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Abstract 

Efficient collaborative decision making is an important challenge for multiagent systems. Finding optimal 
joint actions is especially challenging when each agent has only imperfect information about the state of its 
environment. Such problems can be modeled as collaborative Bayesian games in which each agent receives 
private information in the form of its type. However, representing and solving such games requires space 
and computation time exponential in the number of agents. This article introduces collaborative graphical 
Bayesian games (CGBGs), which facilitate more efficient collaborative decision making by decomposing the 
global payoff function as the sum of local payoff functions that depend on only a few agents. We propose a 
framework for the efficient solution of CGBGs based on the insight that they posses two different types of 
independence, which we call agent independence and type independence. In particular, we present & factor 
graph representation that captures both forms of independence and thus enables efficient solutions. In addi- 
tion, we show how this representation can provide leverage in sequential tasks by using it to construct a novel 
method for decentralized partially observable Markov decision processes. Experimental results in both ran- 
dom and benchmark tasks demonstrate the improved scalability of our methods compared to several existing 
alternatives. 

keywords: reasoning under uncertainty, decision-theoretic planning, multiagent decision making, collabora- 
tive Bayesian games, decentralized partially observable Markov decision processes 



1 Introduction 



Collaborative multiagent systems are of significant scientific interest, not only because they can tackle inher- 
ently distributed problems , but also because they facili t ate the decompo s ition of problems too complex to be 
tackle d by a single agent ( IHuhnsl Il987t ISvcaral Il998t iPanait and Lukel l2005t IVlassisl l2007t iBusoniu et al. , 
20081) . As a result, a fundamental question in artificial intelligence is how best to design control systems for 
collaborative multiagent systems. In other words, how should teams of agents act so as to most effectively 
achieve common goals? When uncertainty and many agents are involved, this question is particularly challeng- 
ing, and has not yet been answered in a satisfactory way. 
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Figure 1 : Illustration of multiagent decision making with imperfect information. Both agents are located near 
house 2 and know that it is on fire. However, each agent receives only a noisy observation of the single 
neighboring house it can observe in the distance. 



A key challenge of collaborative multiag ent decision making is the presence of imperfect information 



dHarsanvil 11967-19681 iKaelbling et all 119981) . Even in single-agent settings, agents may have incomplete 



knowledge of the state of their environment, e.g., due to noisy sensors. However, in multiagent settings this 
problem is often greatly exacerbated, as agents have access only to their own sensors, typically a small fraction 
of those of the complete system. In some cases, imperfect information can be overcome by sharing sensor 
readings. However, due to bandwidth limitations and synchronization issues, communication-based solutions 
are often brittle and scale poorly in the number of agents. 

As an example, consider the situation depicted in Fig. Q] After an emergency call by the owner of house 
2, two firefighting agents are dispatched to fight the fire. While each agent knows there is a fire at house 2, 
the agents are not sure whether fire has spread to the neighboring houses. Each agent can potentially observe 
flames at one of the neighboring houses (agent 1 observes house 1 and agent 2 observes house 3) but neither 
has perfect information about the state of the houses. As a result, effective decision making is difficult. If agent 
1 observes flames in house 1, it may be tempted to fight fire there rather than at house 2. However, the efficacy 
of doing so depends on whether agent 2 will stay to fight fire in house 2, which in turn depends on whether 
agent 2 observes flames in house 3, a fact unknown to agent 1. 

Strategic games, the traditional models of game theory, are poorly suited to modeling such problems because 
they assume that there is only one state, which is known to all the agents. In contrast, in the example of 
Fig-Q] each agent has only a partial view of the state, i.e., from each agent's individual perspective, multiple 
states are possible. Such pro blems of multiagent decision making with imperfect in formation can be modeled 
with Bayesian games (BGs) (IHarsanyii 1 1967-19681 lOsborne and Rubinsteinl 1 19941) . In a BG, each agent has 
a type that specifies what private information it holds. For example, an agent's type may correspond to an 
observation that it makes but the other agents do not. Before the agents select actions, their types are drawn 
from a distribution. Then, the payoffs they receive depend not only on the actions they choose, but also on their 
types. Problems in which the agents have a common goal can be modeled as collaborative Bayesian games 
(CBGs), in which all agents share a single global payoff function. Unfortunately, solving CBGs efficiently is 
difficult, as both the space needed to represent the payoff function and the computation time needed to find 
optimal joint actions scale exponentially with the number of agents. 

In this article, we introduce collaborative graphical Bayesian games ( CGBGs), a new framework designed 
to facilitate more efficient collaborative deci sion making with imperfect information. As in strategic games 
( Guestrin et al. . 2002 : Kok and Vlassis . 20061) . global payoff functions in Bayesian games can often be decom- 
posed as the sum of local payoff functions, each of which depends on the actions of only a few agents. We call 
such games graphical because this decomposition can be expressed as an interaction hypergraph that specifies 
which agents participate in which local payoff functions. 
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Our main contribution is to demonstrate how this graphical structure can be exploited to solve CGBGs 
more efficiently. Our approach is based on the critical insight that CGBGs contain two fundamentally different 
types of independence. Like graphical strategic games, CGBGs possess agent independence: each local payoff 
function depends on only a subset of the agents. However, we identify that GBGs also possess type indepen- 
dence: since only one type per agent is realized in a given game, the expected payoff decomposes as the sum of 
contributions that depend on only a subset of types. 

We propose a factor graph representation that captures both agent and type independence. Then, we 
show how such a factor graph can be used to find optimal joint policies via nonserial dynamic programming 



dBertele and Brioschil Il972t iGuestrin et all |200 2). While this approach is faster than a naive alternative, we 
prove that its computational complexity remains exponential in the number of types. However, we also show 
how the sam e factor graph facilitate s even more efficient, scalable computation of approximate solutions via 
Max-Plus ( Kok and Vlassis . 20061) . a message-passing algorithm. In particular, we prove that each iteration 
of max-plus is tractable for small local neighborhoods. 

Alternative solution approaches for CGBGs can be found among existing techniques. For example, a 
CGBG can be converted to a multiagent influence diagram (MAID) (IKoller and Milchl 120031) . However, 
since the resulting MAID h as a single strongly connected component, the divide and conquer technique pro- 
posed by IKoller and Milchl reduces to brute-force search. Another approach is t o convert CGBGs to non- 
collaborative graphical s t rategic games, for which efficient solu tion algorithms exist (IVickrey and KollerL 12002 : 



an 

Ortiz and KearnsL I2003t [Daskalakis and Papadimitriou, 120061) . However, the conversion process essentially 



strips away the CGBG's type independence, resulting in an exponential increase in the worst-case size o f 
the payoff function. CGBGs can also be modeled as constrai nt optimization proble ms (Modi et al. , 2005 ). 
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for w hich some methods implicitly exploit type independence ( Oliehoek et al. . 2010t Kumar and Zilbersteini 



20101) . However, these methods do not explicitly identify type independence and do not exploit agent indepen 



dence. 

Thus, the key advantage of the approach presented in this article is the simultaneous exploitation of both 
agent and type independence. We present a range of experimental results that demonstrate that this advantage 
leads to better scalability than several alternatives with respect to the number of agents, actions, and types. 

While CGBGs model an important class of collaborative decision-making problems, they apply only to 
one-shot settings, i.e., each agent needs to select only one action. However, CGBG solution methods can also 
provide substantial leverage in sequential tasks, in which agents take a series of actions over time. We illustrate 
the benefits of CGBGs in such settings by using them to constru ct a novel method for solving decentralized 
partially observable Markov decision processes (Dec-POMDPs) ([Bernstein et al.1 12002|) . Our method extends 



an existing approach in which each stage of the Dec-POMDP is modeled as a CBG In particular, we show 
how approximate inference and factored value functions can be used to reduce the problem to a set of CGBGs, 
which can be solved using our novel approach. Additional experiments in multiple Dec-POMDP benchmark 
tasks demonstrate better scalability in the number of agents than several alternative methods. In particular, 
for a sequential version of a firefighting task as described above, we were able to scale to 1000 agents, where 
previous approaches to Dec-POMDPs have not been demonstrated beyond 20 agents. 

The rest of this paper is organized as follows. Sec. |2] provides background by introducing collaborative 
(Bayesian) games and their solution methods. Sec. [3] introduces CGBGs, which capture both agent and type 
independence. This section also presents solution methods that exploit such independence, analyzes their com- 
putational complexity, and empirically evaluates their performance. In Sec. 2] we show that the impact of our 
work extends to sequential tasks by presenting and evaluating a new Dec-POMDP method based on CGBGs. 
Sec.[5]discusses related work, Sec.|6]discusses possible directions for future work, and Sec.|7]concludes. 
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2 Background 



In this section, we provide background on various game-theoretic models for collaborative decision making. 
We start with the well-known framework of strategic games and discuss their graphical counterparts, which 
allow for compact representations of problems with many agents. Next, we discuss Bayesian games, which 
take into account different private information available to each agent. These models provide a foundation for 
understanding collaborative graphical Bayesian games, the framework we propose in Section[3] 



2.1 Strategic Games 



The strategic game (SG) framework dOsborne and Rubinstein! , 1 19941) is probably the most studied of all game- 
theoretic models. Strategic games are also called normal form games or matrix games, since two-agent games 
can be represented by matrices. We first introduce the formal model and then discuss solution methods and 
compact representations. 



2.1.1 The Strategic Game Model 

In a strategic game, a set of agents participate in a one-shot interaction in which they each select an action. The 
outcome of the game is determined by the combination of selected actions, which leads to a payoff for each 
agent. 

Definition 2.1. A strategic game (SG) is a tuple (D,A, (u\,...u n )), where 

• T> = { 1 ,n} is the set of n agents, 

• A = XjAj is the set of joint actions a = (a\,... 1 a„), 

• itj : A — > M. is the payoff function of agent ;. 

This article focuses on collaborative decision making: settings in which the agents have the same goal, 
which is modeled by the fact that the payoffs the agents receive are identical. 

Definition 2.2. A collaborative strategic game (CSG) is a strategic game in which each agent has the same 
payoff function: V ( J V a «/(a) = K/(a). 

In the collaborative case, we drop the subscript on the payoff function and simply write u. CSGs are also 
called identical payoff games or team games. 



2.1.2 Solution Concepts 

A solution to an SG is a description of what actions each agent should take. While many solution concepts have 
been proposed, one of central importance is the equilibrium introduced bv lNash tll950h . 

Definition 2.3. A joint action a = (a\, . . . . . . ,a„) is a Nash equilibrium (NE) if and only if 

Ui((a h ...,ai,...,a n )) > M,-((a 1 ,...,a-,...,a„)), V,- eD , V a / e ^.. (2.1) 

Intuitively, an NE is a joint action such that no agent can improve its payoff by changing its own action. A 
game may have zero, one or multiple NEsQ When there are multiple NEs, the concept of Pareto optimality can 
help distinguish between them. 



Nash proved that every finite game contains at least one NE if actions are allowed to be played with a particular probability, i.e., if 
mixed strategies are allowed. 
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Definition 2.4. A joint action a is Pareto optimal if there is no other joint action a' that specifies at least the 
same payoff for every agent and a higher payoff for at least one agent, i.e., there exists no a' such that 

V, m(a') > m(a) A 3 i u i (a')>u i (a). (2.2) 

If there does exist an a' such that ( 12.21 ) holds, then a' Pareto dominates a. 

Definition 2.5. A joint action a is a Pareto -optimal Nash equilibrium (PONE) if and only if it is an NE and 
there is no other a' such that a' is an NE and Pareto dominates a. 

Note that this definition does not require that a is Pareto optimal. On the contrary, there may exist an a' that 
dominates a but is not an NE. 



2.1.3 Solving CSGs 

In collaborative strategic games, each maximizing entry of the payoff function is a PONE. Therefore, finding 
a PONE requires only looping over all the entries in u and selecting a maximizing one, which takes time 
linear in the size of the game. However, coordination issues can arise when searching for a PONE with a 
decentralized algorithm, e.g., when there are multiple maxima. Ensuring that the agents selec t the same PONE 



can be accomplished by imposing certain social conventions or through repeated interactions (IBouti lien. 119961) . 
In this article, we assume that the game is solved in an off-line centralized planning phase and that the joint 
strategy is then distributed to the agents, who merely execute the actions in the on-line phase. We focus on the 
design of cooperative teams of agents, for which this is a reasonable assumption. 



2.2 Collaborative Graphical Strategic Games 

Although CSGs are conceptually easy to solve, the game description scales exponentially with the number of 
agents. That is, the size of the payoff function and thus the time required for the trivial algorithm is <9(|yi*|"), 
where \A*\ denotes the size of the largest individual action set. This is a major obstacle in the representation 
and solution of SGs for large values of n. Many games, however , possess independence because not all agents 
need to coordinate directly ( Guestrin et al. . 2002 : Kearns et al. . 2001 : Kok and Vlassis . 20061) . This idea is 
formalized by collaborative graphical strategic games. 



2.2.1 The Collaborative Graphical SG Model 

In collaborative graphical SGs, the payoff function is decomposed into local payoff functions, each having 
limited scope, i.e., only subsets of agents participate in each local payoff function. 

Definition 2.6. A collaborative graphical strategic game (CGSG) is a CSG whose payoff function u decom- 
poses over a number p of local payoff functions U = {a 1 , . . . ,u p }: 

«(a)= f> e (a c ). (2.3) 

Each local payoff function u e has scope A(u e ), the subset of agents that participate in if . Here a f denotes the 
local joint action, i.e., the profile of actions of the agents in A(u e ). 



Each local payoff component can be interpreted as a hyper-edge in an interaction h yper-graph IG = 



(D,£) in which the no des D are agents and the hyper-edges £ are local payoff functions (INair et all 12005 



Olie hoek et all l2008cl) . Two (or more) agents are connected by such a (hyper-)edge e G £ if and only if they 
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participate in the corresponding local payoff function M e Note that we shall abuse notation in that e is used 
as an index into the set of local payoff functions and as an element of the set of scopes. Fig. |2k shows the 
interaction hyper-graph of a five-agent CGSG. If only two agents participate in each local payoff function, the 
interaction hyper-graph reduces to a regular gr aph and the framework is identical to that of coordination graphs 



rap 

dGuestrin et al.L 12002: iKok and VlassisLl2006b . 



CGSGs are also similar to graphical games dKearns et all 1200 It iKearnsL 120071: ISoni et alll2007l) . However, 



there is a crucial difference in the meaning of the term 'graphical'. In CGSGs, it indicates that the single, 
common payoff function (u = u\ = ■ ■ ■ = u n ) decomposes into local payoff functions, each involving subsets of 
agents. However, all agents participate in the common payoff function (otherwise they would be irrelevant to the 
game). In contrast, graphical games are typically not collaborative. Thus, in that context, the term indicates that 
the individual payoff functions u\ , . . . ,u„ involve subsets of agents. However, these individual payoff functions 
do not decompose into sums of local payoff functions. 



2.2.2 Solving CGSGs 



Solving a collaborative graphical strategic game entails finding a maximizing joint action. However, if the 
representation of a particular problem is compact, i.e. exponentially smaller than its non-graphical (i.e., CSG) 
representatio n, then the trivial algorithm of Sec. 12.1.31 runs in exponential time. Non-serial dynamic program- 
(NDP) ( Bertele and BrioschiL Il972|). al so known as variable elimination ( Guestrin et al. . 20021 : Vlassis . 



ming 



20071) and bucket elimination dDech ter. 1999), can find an optimal solution much faster by exploiting the struc- 
ture of the problem. We will explain NDP in more detail in Sec. 13.4.11 

Alternatively, Ma x-Plus, a message-passing a lgorithm described further in Sec. 13.4.21 can be applied to 
the interaction graph (Kok and Vlassisll2005l 20061) . In practice, Ma x-Plus is often much faster than NDP 



dKok and Vlassisl l2005l 120061: iFarinelhet all boOoj iKim et al.L [201ob . However, when more than two agents 



participate in the same hyper-edge (i.e., when the interaction graph is a hyper-graph), message passing cannot 
be con ducted on the hyper-graph itself. Fortuna tely, an interaction hyper-graph can be translated into a factor 



graph dKschischang et al.Ll2001tlLoelig er. 2004) to which Max-Plus is applicable. The resulting factor graph 



is a bipartite graph containing one set of nodes for all the local payoff functions and another for all the agents 
A local payoff function u e is connected to an agent ; if and only if i S A(u e ). Fig.|2]illustrates the relationship 
between an interaction hyper-graph and a factor graph. 

It is also possible to convert a CGSG into a (non-collaborative) graphical SG by combining all payoff func- 
tions in which an agent participates in to one normalized, individual payoff function^ Several methods for solv- 

ing gr aphical SGs are then applicable (IVickrey and Kollenl2002l : IOrtiz and KearnsLl2003l : lDaskalakis and Papadimitriou , 
20061) . Unfortunately, the individual payoff functions resulting from this transformation are exponentially larger 
in the worst case. 



2.3 Bayesian Games 

Although strategic games provide a rich model of interactions between agents, they assume that each agent 
has complete knowledge of all relevant information and can therefore perfectly predict the payoffs that result 
from each joint action. As such, they cannot explicitly represent cases where agents possess private information 
that influences the effects of actions. For example, in the firefighting example depicted in Fig. [T] there is no 
natural way in a strategic game to represent the fact that each agent has different information about the state 

2 This constitutes an edge-based decomposition, which stands in contrast to agent-based decompositions iKo k and Vlassis, 200(3). We 
focus on edge-based decompositions because they are more general. 

3 In the terminology of factor graphs, the local payoff functions correspond to factors and the agents to variables whose domains are the 
agents' actions. 

4 This corresponds to converting an edge-based representation to an agent-based representation I Kok and Vlassis, 2006). 
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(a) An interaction hyper-graph. 



(b) The corresponding factor graph. 



Figure 2: A CGSG with five agents. In (a), each node is an agent and each hyper-edge is a local payoff function. 
In (b), the circular nodes are agents and the square nodes are local payoff functions, with edges indicating in 
which local payoff function each agent participates. 



of the houses. In more complex problems with a large number of agents, modeling private information is even 
more important, since assuming that so many agents have perfect knowledge of the complete state of a complex 
environment is rarely realistic. In this section, we describe Bayesian games, which augment the strategic game 
framework to explicitly model private information. As before, we focus on the collaborative case. 

2.3.1 The Bayesian Game Model 



A Bayesian game, also called a strategic game of imperfect information dOsborne and Rubmsteiiull994l) is an 
augmented strategic game in which the players hold private information. The private information of agent ; 
defines its type 9j £ 0,. The payoffs the agents receive depend not only on their actions, but also on their types. 
Formally, a Bayesian game is defined as follows: 

Definition 2.7. A Bayesian game (BG) is a tuple (D,A,0,Pr(0), (u\,...u„)), where 

• D,A are the sets of agents and joint actions as in an SG, 

• = x,- 6 d 0,- is the set of joint types 9 = (8\,... ,0„), 

• Pr(0) is the distribution over joint types, and 

• ui : X A — > K is the payoff function of agent ;. 

In many problems, the types are a probabilistic function of a hidden state, i.e., based on a hidden state, 
there is some probability Pr(0) for each joint type. This is typically the case, as in the example below, when an 
agent's type corresponds to a private observation it makes about such a state. However, this hidden state is not 
a necessary component of a BG. On the contrary, BGs can also model problems where the types correspond to 
intrinsic properties of the agents. For instance, in a employee recruitment game, a potential employee's type 
could correspond to whether or not he or she is a hard worker. 

Definition 2.8. A collaborative Bayesian game ( CBG) is a Bayesian game with identical payoffs: 

VyV e v aM ;(M = «./(M- 

In a strategic game, the agents simply select actions. However, in a BG, the agents can condition their 
actions on their types. Consequently, agents in BGs select policies instead of actions. A joint policy /3 = 
(j3i,...,j3„), consists of individual policies j3, for each agent i. Deterministic (pure) individual policies are map- 
pings from types to actions j3 ( - : 0,- — > A{, while stochastic policies map each type 0, to a probability distribution 
over actions Pr(.A,-). 
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Pr(G\s) 


state s 


Pr(s) 


(Fi,F 2 ) 


(FuN 2 ) 


(N U F 2 ) 


(N U N 2 ) 


no neighbors on fire 


0.7 


0.01 


0.09 


0.09 


0.81 


house 1 on fire 


0.10 


0.09 


0.81 


0.01 


0.09 


house 3 on fire 


0.15 


0.09 


0.01 


0.81 


0.09 


both on fire 


0.05 


0.81 


0.09 


0.09 


0.01 




Pr(0) 


0.07 


0.15 


0.19 


0.59 



Table 1: The conditional probabilities of the joint types given states and the resulting distribution over joint 
types for the two-agent firefighting problem. 





Payoff of joint actions u(s,a) 


state s 


(Hi, H 2 ) 


(Hum) 


(H 2 ,H 2 ) 


(H 2 ,H 3 ) 


no neighbors on fire 


+2 





+3 


+2 


house 1 on fire 


+4 


+2 


+3 


+2 


house 3 on fire 


+2 


+2 


+3 


+4 


both on fire 


+4 


+4 


+3 


+4 



Table 2: The payoffs as a function of the joint actions and hidden state for the two-agent firefighting problem. 
Example: Two-Agent Fire Fighting 

As an example, consider a formal model of the situation depicted in Fig.Q] The agents each have two actions 
available: agent 1 can fight fire at the first two houses {H\ and H 2 ) and agent 2 can fight fire at the last two 
houses (H 2 and H 3 ). Both agents are located near H 2 and therefore know whether it is burning. However, they 
are uncertain whether Hi and Hi, are burning or not. Each agent gets a noisy observation of one of these houses, 
which defines its type. In particular, agent 1 can observe flames (Fi) or not (Ni) at Hi and agent 2 can observe 
(F 2 ) or not (N 2 ) at H3. The probability of making the correct observation is 0.9. Table [TJ shows the resulting 
probabilities of joint types conditional on the state. The table also shows the a priori state distribution — it is 
most likely that none of the neighboring houses are on fire and H3 has a slightly higher probability of being on 
fire than H\ — and the resulting probability distribution over joint types, computed by marginalizing over states: 
Pr(0) = ^ s Pr(0|i)Pr(i). Finally, each agent generates a +2 payoff for the team by fighting fire at a burning 
house. However, payoffs are sub-additive: if both agents fight fire at the same house (i.e., at H 2 ), a payoff of +3 
is generated. Fighting fire at a house that is not burning does not generate any payoff. Table |2] summarizes all 
the possible payoffs. 

These rewards can be converted to the u(9,a) format by computing the conditional state probabilities 
Pr(s|0) using Bayes' rule and taking the expectation over states: 

M(0,a)=£w(i,a)-Pr(i|0). (2.4) 
The result is a fully specified Bayesian game whose payoff matrix is shown in Table[3] 



0i 


02 


F 2 

H 2 Hi 


N 2 
H 2 Hi 


Fi 


H 

H 2 


3.414 2.032 
3 3.543 


3.14 1.22 
3 2.08 


N\ 


Hi 

H 2 


2.058 1.384 
3 3.326 


2.032 0.079 
3 2.047 



Table 3: The Bayesian game payoff matrix for the two-agent firefighting problem. 



2.3.2 Solution Concepts 

In a BG, the concept of NE is replaced by a Bayesian Nash equilibrium (BNE). A profile of policies /3 = 
(/3i r ..,j3 n ) is a BNE when no agent ; has an incentive to switch its policy J3;, given the policies of the other 
agents J3 / ; . This occurs when, for each agent ; and each of its types 0,, /3, specifies the action that maximizes 
its expected value. When a Bayesian game is collaborative, the characterization of a BNE is simpler. Let the 
value of a joint policy be its expected payoff: 

Vtf)= £Pr(0)«(0,j3(0)), (2.5) 

where j3(0) = (j3i(0i),...,/3„(0„)) is the joint action specified by j3 for joint type 0. Furthermore, let the 
contribution of a joint type be: 

C e (a) = Pr(0)w(0,a). (2.6) 

The value of a joint policy j3 can be interpreted as a sum of contributions, one for each joint type. The BNE of 
a CBG maximizes a sum of such contributions. 

Theorem 2.1. The Bayesian Nash equilibrium of a CBG is: 

/T = argmaxV(j3) = argmax £ C e (j3(0)), (2.7) 

P P 060 

which is a Pareto-optimal (Bayesian) Nash equilibrium (PONE). 

Proof. A CBG G can be reduced to a CSG G' where each action of G' corresponds to a policy of G. Further- 
more, in G', a joint action a' corresponds to a joint policy of G and the payoff of a joint action u'(a!) corresponds 
to the value of the joint policy. As explained in S ec. 12.1.21 a PONE for a CSG is a maximizing entry, which 



corresponds to ( |2.7t . For a more formal proof, see ( IQliehoek et al. ■ l2008bh . □ 



2.3.3 Solving CBGs 

Although the characterization of a PONE is simple, fin ding one is intractable in ge neral. In fact, a CBG is 



equivalent to a team decision problem, which is NP-hard dTsitsiklis and AfhansUl985l) . 

Since a CBG is an instance of a (non-collaborative) BG, solution methods for the latter apply. A common 
approach is to convert a BG G to an SG G ', as in the proof of Theorem l2.ll An action aj in G' correspond to a 
policy /3, in G, a\ = /?;, and the payoff of a joint action in G' equals the expected payoff of the corresponding 
joint BG policy w'(a') = V(j3). However, since the number of policies for an agent in a BG is exponential 
in the number of types, the conversion to an SG leads to an exponential blowup in size. When applying this 
procedure in the collaborative case (i.e., to a CBG), the result is a CSG to which the trivial algorithm applies. In 
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effect, since joint actions correspond to joint BG-policies, this procedure corresponds to brute-force evaluation 
of all joint BG-policies. 

A different approach to solving CBGs is alternating maximization (AM). Starting with a random joint policy, 
each agent iteratively computes a best response policy for each of its types. In this way the agents hill-climb 
towards a local optimum. While the method guarantees finding an NE, it can not guarantee finding a PONE and 
there is no bound on the quality of the approximation. By starting fr om a specially construc ted starting point, it 
is possible to give some guarantees on the quality of approximation ( Cogill and LallL 2006h . These guarantees, 
however, degrade exponentially as the number of agents increases. 

Finally, recent work shows that the additive structure of the v alue function d2.7[> can be exploited by heuristic 
search to greatly speed up the computation of optimal solutions ( Oliehoek et al. , 2010l) . Furthermore, the point- 
based backup operation in a Dec-POMDP — which can be interp reted as a special case of CBG — can be solved 
using state-of-the-art weighted constraint satisfaction methods dKumar and Zilberstein, 2010), also providing 
significant increases in performance. 



3 Exploiting Independence in Collaborative Bayesian Games 

The primary goal of this work is to find ways to efficiently solve large CBGs, i.e., CBGs with many agents, 
actions and types. None of the models presented in the previous section are adequate for the task. CGSGs, 
by representing independence between agents, allow solution methods that scale to many agents, but they do 
not model private information. In contrast, CBGs model private information but do not represent independence 
between agents. Consequently, CBG solution methods scale poorly with respect to the number of agents. 

In this section, we propose a new model to address these issues. In particular, we make three main contri- 
butions. First, Sec. 13.1 [ distinguishes between two types of independence that can occur in CBGs: in addition to 
the agent independence that occurs in CGSGs, all CBGs possess type independence, an inherent consequence 
of imperfect information. Second, Sec. l3.2l proposes collaborative graphical Bayesian games, a new framework 
that models both these types of independence. Third, Sec. l3.4l describes solution methods for this model that use 
a novel factor graph representation to capture both agent and type independence such that they can be exploited 
by NDP and Max-Plus. We prove that, while the computational cost of NDP applied to such a factor graph 
remains exponential in the number of individual types, Max-Plus is tractable for small local neighborhoods. 



3.1 Agent and Type Independence 



As explained in Sec. 12.21 in many CSGs, agent interactions are sparse. The resulting independence, which we 
call agent independence, has long been exploited to compactly represent and more efficiently solve games with 
many agents, as in the CGSG model. 

While many CBGs also possess agent independence, the CBG framework provides no way to model or 
exploit it. In addition, regardless of whether they have agent independence, all CBGs possess a second kind 
of independence, which we call type independence, that is an inherent consequence of imperfect information. 
Unlike agent independence, type independence is captur ed in the CBG m odel and can thus be exploited. 

Type independence, which we originally identified in dOliehoekiEoiol) . is a result of the additive structure of 



a joint policy's value (shown in ( 12.7b ). The key insight is that each of the contribution terms from (12.6b depends 
only on the action selected for some individual types. In particular, the action J3;(0;) selected for type 0,- of 
agent ; affects only the contribution terms whose joint types involve 0,-. 

For instance, in the two-agent firefighting problem, one possible joint type is 9 = (N,N) (neither agent 
observes flames). Clearly, the action /3i (F) that agent 1 selects when it has type F (it observes flames), has no 
effect on the contribution of this joint type. 
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Figure 3: A factor graph of the two-agent fire fighting problem, illustrating the type independence inherent in 
CBGs. The action chosen for an individual type 0; affects only a subset of contribution factors. For instance, 
the action that agent 1 selects when it has type N affects only the contribution factors C^.a?,) and C^ Nl j^ in 
which it has that type. 



As illustrated in Fig. [3j this type of structure can also be represented by a factor graph with one set of nodes 
for all the contributions (corresponding to joint types) and another set for all the individual types of all the 
agents. Unlike the repres e ntatio n that results from reducing a BG to an SG played by agent-type combinations 



(Os borne and Rubinsteinl 119941) . this factor graph does not completely 'flatten' the utility function. On the 
contrary, it explicitly represents the contributions of each joint type, thereby capturing type independence. The 
distinction between agent and type independence is summarized in the following observation. 

Observation 1. CBGs can possess two different types of independence: 

1. Agent independence: the payoff function is additively decomposed over local payoff functions, each 
specified over only a subset of agents, as in CGSGs. 

2. Type independence: only one type per agent is actually realized, leading to a value function that is 
additively decomposed over contributions, each specified over only a subset of types. 

The consequence of this distinction is that neither the CGSG nor CBG model is adequate to model complex 
games with imperfect information. To scale to many agents, we need a new model that expresses (and therefore 
makes it possible to exploit) both types of independence. In the rest of this section, we propose a model that 
does this and show how both types of independence can be represented in a factor graph which, in turn, can be 
solved using NDP and Max-Plus. 



3.2 The Collaborative Graphical Bayesian Game Model 

A collaborative graphical Bayesian game is a CBG whose common payoff function decomposes over a number 
of local payoff functions (as in a CGSG). 

Definition 3.1. A collaborative graphical Bayesian game ( CGBG) is a tuple ( r D,A,&,y,U), with: 

• D,A,& as in a Bayesian game, 

• y = {Pr(©i),...,Pr(®p)} is a set of consistent local probability distributions, 

• It = {u , . . . ,u p } is the set of p local payoff functions. These correspond to a set £ of hyper-edges of 
an interaction graph, such that the total team payoff can (with some abuse of notation) be written as 
u(0,a) =I e6 g" e ( e,a e ). 
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A CGBG is collaborative because all agents share the common payoff function u(0,a). It is also graphical 
because this payoff function decomposes into a sum of local payoff functions, each of which depends on only 
a subset of agents0 As in CGSGs, each local payoff function u e has scope A(u e ), which can be expressed in an 
interaction hyper-graph IG = (D,£) with one hyper-edge for each e £ £. Strictly speaking, an edge corresponds 
to the scope of a local payoff function, i.e., the set of agents that participate in it (as in 'a/), but we will also 
use e to index the sets of hyper-edges and payoff functions (as in u e ). 

Each local payoff function depends not only on the local joint action a e , but also on the local joint type 9 e , 
i.e., the types of the agents in e (i.e., in A(u e )). Furthermore, each local probability function Pi(9 e ) specifies 
the probability of each local joint type. The goal is to maximize the expected sum of rewards: 

/T = argmax£Pr(0) M (0 ,/3 (0)) = argmax £ £Pr(0,K(0 e ,j3 e (0 e )) (3.1) 

where P e (0 e ) is the local joint action under policy /3 given local joint type e . 

In principle, the local probability functions can be computed from the full joint probability function Pr(@). 
However, doing so is generally intractable as it requires marginalizing over the types that are not in scope. By 
including 7 in the model, we implicitly assume that Pr(@) has a compact representation that allows for efficient 
computation of Pr(0 f ), e.g., by means of Bayesian networks ( PearlL 1988 : Bishopl 20061) or other graphical 
models. 

Not all probability distributions over joint types will admit such a compact representation. However, those 
that do not will have a size exponential in the number of agents and thus cannot even be represented, much less 
solved, efficiently. Thus, the assumption that these local probability functions exist is minimal in the sense that 
it is a necessary condition for solving the game efficiently. Note, however, that it is not a sufficient condition. 
On the contrary, the computational advantage of the methods proposed below results from the agent and type 
independence captured in the resulting factor graph, not the existence of local probability functions. 



Example: Generalized Fire Fighting 

As an example, consider GENERALIZED Fire FIGHTING, which is like the two-agent firefighting problem of 
Sec. 12.3. fl but with n agents. In this version there are Ng houses and the agents are physically distributed over 
the area. Each agent gets an observation of the No nearest houses and may choose to fight fire at any of the 
Na nearest houses. For each house H, there is a local payoff function involving the agents in its neighborhood 
(i.e., of the agents that can choose to fight fire at H). These payoff functions yield sub-additive rewards similar 
to those in Table [2] The type of each agent ; is defined by the No observations it receives from the surrounding 
houses: 0; £ {Fi,Ni} N °. The probability of each type depends on the probability that the surrounding houses 
are burning. As long as those probabilities can be compactly represented, the probabilities over types can be 
too. Fig.|4]illustrates the case where Ng = 4 and n = 3. Each agent can go to the Na =2 closest houses. In this 
problem, there are 4 local payoff functions, one for each house, each with limited scope. Note that the payoff 
functions for the first and the last house include only one agent, which means their scopes are proper subsets 
of the scopes of other payoff functions (those for houses 2 and 3 respectively). Therefore, they can be included 
in those functions, reducing the number of local payoff functions in this example to two: one in which agents 1 
and 2 participate, and one in which agents 2 and 3 participate. 



3.3 Relationship to Other Models 



To provide a better understanding of the CGBG model, we elaborate on its relati onship with existi ng models 



Just as CGSGs are related to graphical games, CGBGs are related to graphical BGs (ISoni et al.ll2007l) . However 



'Arguably, since all CBGs have type independence, they are in some sense already graphical, as illustrated in Fig. [5] However, to be 
consistent with the literature, we use the term graphical here to indicate agent independence. 



12 





Figure 4: Illustration of GENERALIZED Fire FIGHTING with Afo = 4 and n = 3. 



as before, there is a crucial difference in the meaning of the term 'graphical'. In CGBGs, all agents participate 
in a common payoff function that decomposes into local payoff functions, each involving subsets of agents. In 
contrast, graphical BGs are not necessarily collaborative and the individual payoff functions involve subsets of 
agents. Since these individual payoff functions do not decompose, CGBGs are not a special case of GBGs but 
rather a unique, novel formalism. In addition, the graphical BGs considered by lSoni et al.l (120071) make much 
more restrictive assumptions on the type probabilities, allowing only independent type distributions (i.e., Pr(@) 
is defined as the product of individual type probabilities Pr(0, )) and assuming conditional utility independence 
(i.e., the payoff of an agent i depends only on its own type, not that of other agents: M, (0;,a)). 

More closely relate d is the multiagent influe nce diagram (MAID) framework that extends decision diagrams 
to multiagent settings dKoller and Milchl 120031) . In particular, a MAID represents a decision problem with a 
Bayesian network that contains a set of chance nodes and, for each agent, a set of decision and utility nodes. As 
in a CGBG, the individual payoff function for each agent is defined as the sum of local payoffs (one for each 
utility node of that agent). On the one hand, MAIDs are more general than CGBGs because they can represent 
non-identical payoff settings (though it would be straightforward to extend CGBGs to such problems). On 
the other hand, CGBGs are more general than MAIDs since they allow any representation of the distribution 
over joint types (e.g., a Markov random field), as long as the local probability distributions can be computed 
efficiently. 

When the local probabilities CP are explicitly computed, a CGBG can be represented as a MAID, as illus- 
trated in Fig. [5] In this MAID, both utility nodes are associated with all agents, such that each agent's goal is to 
optimize the sum u 1 + u 2 and the MAID is collaborative. The probability of an individual type is a deterministic 
function of all the incoming local joint types, e.g., the probability of a particular value of &> is 1 if (and only if) 
both 0i2 and 023 specify that value. 



However, the resulting MAID's relevance graph dKoller and Milchll2003l) . which indicates which decisions 
influence each other, consists of a single strongly connected component. Consequently, the divide and conquer 
solution method proposed by iKoller and Milchl offers no speedup over brute-force evaluation of all the joint 
policies. In the following section, we propose methods to overcome this problem and solve CGBGs efficiently. 



3.4 Solution Methods 

Solving a CGBG amounts to finding the maximizing joint policy as expressed by d3 .lb - As mentioned in 
Sec. 12.3.31 it is possible to conve rt a BG to an SG in which the actions correspond to BG policies. In previ- 



ous work dOliehoek et all l2008d) . we applied similar transformations to CGBGs, yielding CGSGs to which 
all the solution methods mentioned in Sec. 12.21 are applicable. Alternatively, it is possible to convert to a 
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(non-collaborative) graphical BG dSoni et aU 12007b and apply the proposed solution method. Under the hood, 
however, this method converts to a graphical SG. 

The primary limitation of all of the options mentioned above is that they exploit only agent independence, 
not type independence. In fact, converting a CGBG to a CGSG has the effect of stripping all type independence 
from the model. To see why, note that type independence in a CBG occurs as a result of the form of the payoff 
function u(6,j3(6)). In other words, the payoff depends on the joint action, which in turn depends only on the 
joint type that occurs. Converting to a CSG produces a payoff function that depends on the joint action selected 
for all possible joint types, effectively ignoring type independence. A direct result is that the solution methods 
have an exponential dependence on the number of types. 

In this section, we propose a new approach to solving CGBGs that avoids this problem by exploiting both 
kinds of independence. The main idea is to represent the CGBG using a novel factor graph formulation that 
neatly captures both agent and type independence. The resulting factor graph can then be solved using methods 
such as NDP and Max-Plus. 

To enable this factor graph formulation, we define a local contribution as follows: 

q e (a e ) =Pr(9 e y(9 e # e ) (3.2) 
Using this notation, the solution of the CGBG is 

r = argmax££c^(/3 e (0 e )). (3.3) 

P ee£ e e 

Thus, the solution corresponds to the maximum of an additively decomposed function containing a con- 
tribution for each local joint type 9 e . This can be expressed in a factor graph with one set of nodes for all 
the contributions and another for all the individual types of all the agents. An individual type 0j of an agent i 
is connected to a contribution C e e only if i participates in u e and 9 e = {0 7 ) jG A(« f ) specifies 0,- for agent i, as 
illustrated in Fig. [6] We refer to this graph as the agent and type independence (ATI) factor graph^ Contri- 
butions are separated, not only by the joint type to which they apply, but also by the local payoff function to 
which they contribute. Consequently, both agent and type independence are naturally expressed. In the next 
two subsections, we discuss the application of NDP and Max-Plus to this factor graph formulation in order 
to efficiently solve CGBGs. 



6 I n previous work, we referred to this as the 'type-action' factor graph, since its variables correspond to actions selected for individual 
types (Oliehoek, 2Ql(j). 
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Figure 6: Factor graph for GENERALIZED Fire FIGHTING with three agents, two types per agent, and two local 
payoff functions. Agents 1 and 2 participate in payoff function u (corresponding to the first four contributions), 
while agents 2 and 3 participate in u 2 (corresponding to the last four contributions). The factor graph expresses 
both agent independence (e.g., agent 1 does not participate in u 2 ) and type independence (e.g, the action agent 
1 selects when it receives observation 0\ affects only the first 2 contributions). 



3.4.1 Non-Serial Dynamic Programming for CGBGs 



Non-serial dynamic programming (NDP) (IBertele and Brioschil 1 19721) can be used to find the maximum con- 
figuration of a factor graph. In the forward pass, the variables in the factor graph are eliminated one by one 
according to some prespecified order. Eliminating the fcth variable v involves collecting all the factors in which 
it participates and replacing them with a new factor f k that represents the sum of the removed factors, given that 
v selects a best response. Once all variables are eliminated, the backwards pass begins, iterating through the 
variables in reverse order of elimination. Each variable selects a best response to the variables already visited, 
eventually yielding an optimal joint policy. 

The maximum number of agents participating in a factor encountered duri ng NDP i s know n as the induced 
width w of the ordering. The following result is well-known (see for instance dDechter , 119991) 1: 



Theorem 3.1. NDP requires exponential time and space in the induced width w. 

Even though NDP is still exponential, for sparse problems the induced with is much smaller than the total 
number of variables V, i.e., w <C V, leading to an exponential speed up over naive enumeration over joint 
variables. 

In previous work dOliehoek et al. , 2008c ). we used NDP to optimally solve CGBGs. However, NDP was 
applied to the agent independence (Al) factor graph (e.g., as in Fig.|2bl that results from converting the CGBG 
to a CGSG Consequently, only agent independence was exploited. In principle, we should be able to improve 
performance by applying NDP to the ATI factor graph introduced above, thereby exploiting both agent and type 
independence. Fig.|7]illustrates a few steps of the resulting algorithm. 

However, there are two important limitations of the NDP approach. First, the computational complexity 
is exponential in the induced width, which in turn depends on the order in which the variables are eliminated. 
Determining t he optimal order (whi ch Bertele and Brioschil ( 1973 ) call the secondary optimization problem) is 



NP-complete (lArnborg et all 11987). Wh ile there are heuristics for determining the order, NDP scales poorly 



in practice on densely connected graphs (IKok and VlassisL 120061) . Second, because of the particular shape that 
type independence induces on the factor graph, we can establish the following: 
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Figure 7: A few steps of NDP run on the factor graph of Fig. [6] Variables are eliminated from left to right. 
Dotted ellipses indicate the part of the factor graph to be eliminated. 

Theorem 3.2. The induced with of an ATI factor graph is exponential in the number of individual types: 
w oc 0(exp(\& t \)), where 0* denotes the largest individual type set. 

Proof. Let us consider the first elimination step of NDP for an arbitrary Qf 1 (i.e., an arbitrary variable). Now, 
each edge e 6 £ in which it participates induces (9(|©*|' £ ''~ 1 ) contributions to which it is connected: one con- 
tribution for each profile 9 e \ t of types of the other agents in e. As a result, the new factor f l is connected to all 
types of the neighbors in the interaction hyper-graph. The number of such types is at least ©*. □ 

Note that |©„ | is not the only term that determines w; the number of edges e e £ in which agents participate 
as well as the elimination order still matter. In particular, let k denote the maximum degree of a contribution 
factor, i.e., the largest local scope k = max^gg \A(u e ) |. Clearly, since there is a factor that has degree k, we have 
that w>k. 

Corollary 1. The computational complexity of NDP applied to an ATI factor graph is exponential in the number 
of individual types. 

Proof. This follows directly from theorems 13 . 1 1 and [3721 □ 

Therefore, even given the ATI factor graph formulation, it seems unlikely that NDP will prove useful in 
exploiting type independence. In particular, we hypothesize that NDP applied to the ATI factor graph will not 
perform significantly better than NDP on the AI factor graph. In fact, it is possible that the former performs 
worse than the latter. This is illustrated in the last elimination step shown in Fig. [7] A factor / 3 is introduced 
with degree w = 3 and size |.A*| W . In contrast, performing NDP on the AI factor graph using the 'same' left-to- 
right ordering has induced width w = 1 and the size of the factors constructed is (\A*\ 2 ) w (where 2 = |@*|). 
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3.4.2 Max-Plus for CGBGs 



In order to more effectively exploit type independence, we consider a second approach in which the fac- 



tor graph is solved using the Max-P lus message passing algorithm (Pearl 1988 : Wainw 



ight et all [2004 



Kok and VlassisL l2005t IVlass is. 2007). Max-Plus was originally proposed by iPearll (119881) under the name 
belief revision to compute the maximum a posteriori probability configuration s in Bayesian networks. The 



algorithm is also know n as max-product or min -sum (IWainwright et all 120041) and is a special case of the 



sum-product algorithm dKschischang et all 120011) — also referred to as belief propag ation in probabilistic do 



mains. Max-Plus can be implemented in either a centralized or decentralized way (IKok and VlassisL 120061) . 
However, since we assume planning takes place in a centralized off-line phase, we consider only the former. 

Max-Plus algorithm is an appealing choice for seve ral reasons. First, on structured problems it has 
been shown to ac hieve excellent performance in practice ( Kschischang et all 2001; Kok and Vlassis , 20061: 



Kuyer et alll2008l) . Second, unlike ND P, it is an anytime algor ithm that can provide results after each iteration 



of the algorithm, not only at the end (IKok and VlassisL 120061) . Third, as we show below, its computational 
complexity is exponential only in the size of the largest local payoff function's scope, which is fixed for a given 
CGBG. 

At an intuitive level, Max-Plus works by iteratively sending messages between the factors, corresponding 
to contributions, and variables, corresponding to (choices of actions for) types. These messages encode how 
much payoff the sender expects to be able to contribute to the total payoff. In particular, a message sent 
from a type i to a contribution j encodes, for each possible action, the payoff it expects to contribute. This is 
computed as the sum of the incoming messages from other contributions k ^ j. Similarly, a message sent from 
a contribution to a type i encodes the payoff it can contribute conditioned on each available action to the agent 
with type /Q 

Max-Plus iteratively passes these messages over the edges of the factor graph. Within each iteration, the 
messages are sent either in parallel or sequentially with a fixed or random ord ering. When run on an acyclic 



factor graph (i.e., a tree), it is guaranteed to converge to an optimal fixed point dPearlL 1 1988L IWainwright et al 



2004). In cyclic factor graphs, such as those defined in Sec. 13.41 there are no guarantees that Max-Plus will 
converge^ However, experimental re s ults have demon s trated that it works well i n practice even when cycles 



are present dKschischang et all 1200 lL IKok and VlassisL 12006c iKuyer et all 120081) . This requires normalizing 



the messages to prevent them from growing ever larger, e.g. by taking a weighted sum of the new and old 
messages (damping). 

As mentioned above, the computational complexity of Max-Plus on a CGBG is exponential only in the 
size of the largest local payoff function's scope. More precisely, we show here that this claim holds for one 
iteration of Max-Plus. In general, it is not possible to bound the number of iterations, since Max-Plus is not 
guaranteed to converge. However, by applying renormalization and/or damping, Max-Plus converges quickly 
in practice. Also, since Max-Plus is an anytime algorithm, it is possible to limit the number of iterations to a 
constant number. 

Theorem 3.3. One iteration of Max-Plus run on the factor graph constructed for a CGBG is tractable for 
small local neighborhoods, i.e., the only exponential dependence is in the size of the largest local scope. 



(3.4) 



Proof. Lemma [8TT1 in the appendix characterizes the complexity of one iteration of Max-Plus as 

O ( m k -k 2 -l-F 



where, for a factor graph for a CGBG, the interpretation of the symbols is as follows: 



For a detailed description of how the messages are computed, see iQliehoekl . l2oToL Sec. 5.5.3). 
8 However, recent variants of the message passing approach have slight modifications 



that yield convergence guaran- 
tees iGlober son and Jaakkola. 120081) . Since we found that regular Max-Plus performs well in our experimental setting, we do not 
consider such variants in this article. 
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• m is the maximum number of values a type variable can take. It is given by m = \A*\, the size of the 
largest action set. 

• k is the maximum degree of a contribution (factor), given by the largest local scope k = max eG g A(u e ) \. 

• I is the maximum degree of a type (variable). Again, each local payoff function e 6 £ in which it 
participates induces (9(|©*| i_1 ) contributions to which it is connected. Let p* denote the maximum 
number of edges in which an agent participates. Then / = <9(p* ■ |0* | ). 

• F = 0(fi ■ I©*!*) is the number of contributions, one for each local joint type. 

By substituting these numbers and reordering terms we get that one iteration of Max-Plus for a CGBG has 
cost: 

o(\A,\ k -k 2 -pp,\®,\ 2k - i y (3.5) 

Thus, in the worst case, one iteration of Max-Plus scales polynomially with respect to the number of local 
payoff functions p and the largest sets of actions \A*\ and types [©*[. It scales exponentially in k. □ 

Given this result, we expect that Max-Plus will prove more effective than NDP at exploiting type inde- 
pendence. In particular, we hypothesize that Max-Plus will perform better when applied to the ATI factor 
graph instead of the AI factor graph and that it will perform better than NDP applied to the ATI factor graph. In 
the following section, we present experiments evaluating these hypotheses. 



3.5 Random CGBG Experiments 

To assess the relative performance of NDP and Max-Plus, we conduct a set of empirical evaluations on 
randomly generated CGBGs. We use randomly generated games because they allow for testing on a range of 
different problem parameters. In particular, we are interested in the effect of scaling the number of agents «, 
the number of types |@,| that each agent has, the number of actions for each agent \Aj\, as well as the number 
of agents involved in each payoff function, |A(e)|. We assume each payoff function has an equal number of 
agents and refer to this property as k = max ee g |A(e)|, as in Theorem l3.3l 

Furthermore, we empirically evaluate the influence of exploiting both agent and type independence versus 
exploiting only one or the other. We do so by running both NDP and Max-Plus on the agent-independence 
(AI) factor graph (Fig.lZbli. the type-independence (TI) factor graph (Fig. [3]), and the agent and type indepen- 
dence (ATI) factor graph (Fig. [6]). 

These experiments serve three main purposes. First, they empirically validate Theorems 13.21 and 13.31 con- 
firming the difference in computational complexity between NDP and Max-Plus. Second, they quantify the 
magnitude of the difference in runtime performance between these two methods. Third, they shed light on the 
quality of the solutions found by Max-Plus, which is guaranteed to be optimal only on tree-structured graphs. 



3.5.1 Experimental Setup 

For simplicity, when generating CGBGs, we assume that 1) the scopes of the local payoff functions have the 
same size k, 2) the individual action sets have the same size \At\, and 3) the individual type sets have the same 

size |@,|. For each set of parameters, we generate 1,000 CGBGs on whi ch to test. 

Eachgame is generated following a procedure similar to that used bv lKok and Vlassis d2006h for generating 



CGSGsO We start with a set of « agents with no local payoff functions defined, i.e., they form an interaction 



9 The main difference is in the termination condition: we stop adding edges when the interaction graph is fully connected instead of 
adding a pre-defined number of edges. 
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(a) n = 5, jfc = 2. (b) * = 3, leffcn = 5, righto! = 8. (c) n = S,k = 2. 

Figure 8: Example interaction hypergraphs of randomly generated CGBGs for different parameter settings. 



hypergraph with no edges. As long as the interaction hypergraph is not yet connected, i.e., there does not exist 
a path between every pair of agents, we add a local payoff function involving k agents. 

As a result, the number of edges in different CGBGs generated for the same parameter setting may differ 
significantly. The k agents that participate in a new edge are selected uniformly at random from the subset 
of agents involved in the fewest number of edges. Payoffs u e (O e ,a e ) are drawn from a normal distribution 
N(0,1), and the local joint type probabilities Pr(0 e ) are drawn from a uniform distribution and then normalized. 
This algorithm results in fully connected interaction hypergraphs that are balanced in terms of the number of 
payoff functions in which each agent participates. Fig.|8]shows some examples of the interaction hypergraphs 
generated. 

We test the following methods: 



NDP Non-serial dynamic programming (see Sec. 13.4. Il l, run on the agent-independence factor graph (NDP- 
AI), the type-independence factor graph (NDP-TI), and the proposed agent-type independence factor 
graph (NDP- ATI). 

MP Max-Plus with the following parameters: 10 restarts, 25 maximum iterations, sequential random mes- 
sage passing scheme, damping factor 0.2. Analogous to NDP, there are three variations Max-Plus-AI, 
Max-Plus-TI, and Max-Plus-ATI. 

BaGa BaB Bayesian game branch and bound (BaGaBaB) is a fast method for optimally solving CBGs 
( Oliehoek et al. , 2010t) . It performs heuristic search over partially specified policiesFl 



AltMax Alternating Maximization with 10 restarts (see Sec. 12.3.31 . 



CE Cross Entropy optimization (Ide Boer et al.L 120051) is a randomized optimization method that maintains a 
distribution over joint policies. It works by iterating the following steps: 1) sampling a set of joint 
policies from the distribution 2) using the best fraction of samp les to update the distrib ution. We used 

" ' 2008al) (CEnormal), 



two parameter settings: one that gave good results according to (lOliehoek et al 
and one that is faster (CEfast) 



All methods were implemented using the MAD P Toolbox dSpaan and OliehoekL 12008b : the NDP and Max-Plus 
implementations also use LIBDAI (IMooii l2008bl) . Experiments in this section are run on an Intel Core i5 CPU 
(2.67GHz) using Linux, and the timing results are CPU times. Each process is limited to 1GB of memory use 
and the computation time for solving a single CGBG is limited to 5s. 
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In the experiments we used the 'MaxContributionDifference' joint type ordering and use the 'consistent complete information' heuris- 



tic. 



"Both variants perform 10 restarts, use a learning rate of 0.2 and perform what (Oliehoek et al., 2008a) refer to as 'approximate evalu- 
ation' of joint policies. CENORMAL performs 300 and CEFAST 100 simulations per joint policy. CENORMAL performs / = 50 iterations, 



in each of which N = 100 joint policies are sampled of which Nj, - 
/= 15, AT = 40, N h = 2. 



5 policies are used to update the maintained distribution. CEFAST uses 
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FG type 


num. factors (F) 


fact, size 


fact, deg 




num. vars 


var. size (m) 


var. deg. (/) 


AI 


P 




k*l 




n 




p* 


TI 




1^4* 


n 




«|®*| 






ATI 


Pi®*!" 




\e*\ 




«|@*| 




i^r 1 



Table 4: A characterization of the different types of factor graphs: agent-independence (AI), type-independence 
(TI), agent-type independence (ATI). The symbols relate to ( 13.4b . 

For each method, we report both the average payoff and the average CPU time needed to compute the 
solution. In the plots in this section, each data point represents an average over N g = 1,000 games. The 
reported payoffs are normalized with respect to those of Max-Plus-ATI. As such, the payoff of Max-PluS- 
ATI is always 1. Error bars indicate the standard deviation of the sampled mean <7 mean = a/^/Ng. 

3.5.2 Comparing Methods 

First, we compare NDP-ATI and Max-Plus-ATI with other methods that do not exploit a factor graph rep- 
resentation explicitly, the results of which are shown in Fig. [9] These results demonstrate that, as the number 
of agents increases, the average payoff of the approximate non-graphical methods goes down (Fig. |9a]i while 
computation time goes up (Fig.|9bl, given that the other parameters are fixed at at k = 2, |©,| = 3, \Aj\ = 3. Note 
that a data point is not presented if the method exceeded the pre-defined resource limits on one or more test 
runs. For instance, BaGaBaB can compute solutions only up to 4 agents. Also, Fig.|9b]suggests that NDP-ATI 
would on average complete within the 5s deadline for 6 agents. However, because there is at least one run that 
does not, the data point is not included. 

Next, we fix the number of agents to 5, and vary the number of actions \Ai\. While CENORMAL never 
meets the time limit, the computation time that CEfast requires is relatively independent of the number of 
actions (Fig. |9d| i. Payoff, however, drops sharply when the number of actions increases (Fig. [9cJ. The CE 
solvers maintain a fixed-size pool of possible solutions, which explains both phenomena: the same number of 
samples need to cover a larger search space, but the cost of evaluating each sample is relatively insensitive to 
\Ai\. 

Finally, we consider the behavior of the different methods when increasing the number of individual types 
®i| (Fig.|9eland|9fY Since the number of policies for an agent is exponential in the number of types, the existing 
methods scale poorly. As established by CorollaryQ] NDP's computational costs also grow exponentially with 
the number of types. Max-Plus, in contrast, scales much better in the number of types. Looking at the quality 
of the found policies, we see that Max-Plus achieves the optimal value in these experiments, while the other 
approximate methods achieve lower values. 

3.5.3 Comparing Factor-Graph Formulations 

We now turn to a more in-depth analysis of the NDP and Max-Plus methods, in order to establish the effect of 
exploiting different types of independence. In particular, we test both methods on three different types of fac- 
tor graphs: those with agent-independence (AI), type-independence (TI), and agent-type independence (ATI). 
Table |4] summarizes the types of factor graphs and the symbols used to describe their various characteristics. 

First, we consider scaling the number of agents, using the same parameters as in Fig.|9aland|9bl Fig. UOal 
shows the payoff of the different methods. The difference between them is not significant. However, Fig. llObl 
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Figure 9: Comparison of Max-Plus-ATI and NDP-ATI with other methods, scaling the number of agents 
and |(b)) >, the number of actions ( |(c)| and [(d)| ), and the number of types ( |(e)| and[(f)|. 
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shows the computation times of the same methods. As expected, the methods that use only type independence 
scale poorly because the number of factors in their factor graph is exponential in the number of agents (TablelU). 

Fig. llOcl and llOdl show similar comparisons for payoff functions involving three agents, i.e., k = 3 (example 
interaction hypergraphs are shown in Fig.[8bl. The difference in payoff between NDP-ATI and Max-Plus-ATI 
is not significant (minimum p-value is 0.61907 for 6 agents). Differences with AM and the outlying points of 
MAX-PLUS-AI are significant (p-value < 0.05). The NDP-AI and NDP-ATI methods scale to 6 agents, while 
Max-Plus-AI and Max-Plus-ATI scale beyond. The payoff of Max-Plus-AI is worse and more erratic 
than the payoff of Max-Plus-ATI. In this case, due to increased problem complexity, the Max-Plus methods 
typically do not attain the true optimum. 

These experiments clearly demonstrate that only Max-Plus-AI and Max-Plus-ATI scale to larger num- 
bers of agents. The poor scalability of the non-factor-graph methods is due to their failure to exploit the 
independence in CGBGs. The methods using TI factor graphs scale poorly because they ignore independence 
between agents. As hypothesized, NDP is not able to effectively exploit type independence and consequently 
NDP-ATI does not outperform NDP-AI. In fact, the experiments show that, in some cases, NDP-AI slightly 
outperforms NDP-ATI. 

Fig. llOel and llOfl show the performance of MAX-PLUS-AI and Max-Plus-ATI for games with k = 2, |©,-| = 
4, |a,-| = 4 and larger numbers of agents, from n = 10 up to « = 725 (limited by the allocated memory space). 
For this experiment, the methods were allowed 30s per CGBG instance. Fig. UOel shows the absolute payoff 
obtained by both methods and the growth in the number of payoff functions (edges). The results demonstrate 
that the Max-Plus-ATI payoffs do not deteriorate when increasing the number of payoff functions. Instead, 
they increase steadily at a rate similar to the number of payoff functions. This is as expected, since more 
payoff functions means there is more reward to be collected. Max-Plus-AI scales only to 50 agents, and 
its payoffs are close to those obtained by Max-Plus-ATI. Fig. I lOfl provides clear experimental corroboration 
of Theorem 13.31 (which states that there is no exponential dependence on the number of agents), by showing 
scalability to 725 agents. 

Analogously to Sec. 13.5.21 we compare the different factor-graph methods when increasing the number of 
actions and types. Fig. lllbl shows that Max-Plus-ATI scales better in the number of actions while obtaining 
payoffs close to optimal (when available) and better than other Max-Plus variations (Fig. II 1 at (differences are 
not significant). In fact, it is the only method whose computation time increases only slightly when increasing 
the number of actions: the size of each factor is only \A*\ k compared to |.A*|' *'* for AI and \A*\" for TI 
(TableHJi. In this case k = 2 and n = 5, and in general k <C n in the domains we consider. 

When scaling the number of types (Fig. lllcl and llldb . again there are no significant differences in payoffs. 
However, as expected given the lack of exponential dependence on |®,| (Table |4j, Max-Plus-ATI performs 
much better in terms of computation times. 

Overall, the results presented in this section demonstrate that the proposed methods substantially outperform 
existing solution methods for CBGs. In addition, the experiments confirm the hypothesis that NDP is not able 
to effectively exploit type independence, resulting in exponential scaling with respect to the number of types. 
Max-Plus on ATI factor graphs is able to effectively exploit both agent and type independence, resulting in 
much better scaling behavior with respect to all model parameters. Finally, the experiments showed that the 
value of the found solutions was not significantly lower than the optimal value and, in many case, significantly 
better than that found by other approximate solution methods. 

3.6 Generalized Fire Fighting Experiments 

The results presented above demonstrate that Max-Plus-ATI can improve performance on a wide range of 
CGBGs. However, all of the CGBGs used in those experiments were randomly generated. In this section, we 
aim to demonstrate that the advantages of Max-Plus-ATI extend to a more realistic problem. To this end, 
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Figure 10: Comparison of the proposed factor-graph methods when scaling the number of agents n. Plots |(a)| 
and |(b)| consider k = 2 (analogous to Fig.|9aland|9b]i, while |(c)| and |(d)| show results for hyper-edges with k = 3. 
Plots |(e)| and[(f)1display the scaling behavior for many agents. 
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Figure 11: Comparison of the proposed factor-graph methods when scaling the number of actions and the 
number of types. Plots |(a)| and |(b)| consider scaling \Ai\ (analogous to Fig.|9cland|9d|i, while |(c)| and |(d)| show 
results increasing |@,| (cf. Fig.[9e1and[9ft. 



we apply it to a 2-dimensional implementation of the GENERALIZED Fire FIGHTING problem described in 
Sec. 13.21 Each method was limited to 2Gb of memory and allowed 30s computation time. 

In this implementation, the houses are uniformly spread across a 2-dimensional plane, i.e., the x and y 
coordinates for each house are drawn from a uniform distribution over the interval [0,1]. Similarly, each of the 
« agents is assigned a random location and can choose to fight fire at any of Na nearest houses, subject to the 
constraint that at most k agents can fight fire at a particular house. We enforce this constraint by making a house 
unavailable to additional agents once it is in the action sets of k agents0 In addition, each agent is assigned, in 
a similar fashion, the No nearest houses that it can observe. As mentioned in Sec. 13.21 a type 0, is defined by 
the No observations the agent receives from the surrounding houses: 0,- 6 {Fj,N} N °. We assume that No < Na 
to ensure that no local payoff function depends on an agent simply due to that agent's type, i.e., an agent never 
observes a house at which it cannot fight fire. 

To ensure there are always enough houses, the number of houses is made proportional to both the number 

12 While this can lead to a sub-optimal assignment, we do not need the best assignment of action sets in order to compare methods. 
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X H 


Pi(F\x H ) 





0.2 


1 


0.5 


> 1 


0.8 



Table 5: The observation probabilities of a house H. 



of agents and actions: Nh = ceil(A^ -A^ ■ where Nd is set to 1.2 unless noted otherwise. Each house has a 
fire level that is drawn uniformly from {0, . . . ,Nf — 1}. The probability that an agent receives the observation 
F for a house H depends on its fire level, as shown in Table [5] Observations of different houses by a single 
agent i are assumed to be independent, but observations of different agents that can observe the same house are 
coupled through the hidden state. 

The local reward induced by each house is depends on the fire level and the number of agents that chose to 
fight fire at that house. It is specified by 

R(XB ^present) = -Jtj? • 0.7"*—*. 

As in Sec. 12.3.11 this reward can be transformed to a (in this case local) utility function by taking the expectation 
with respect to the hidden state: 

u h (Qh,&h) = P r ( x # |0//)^(*//iCountAgentsAtHouse(x#,a#)), 

X H = \ 

where CountAgentsAtHouse() counts the number of agents for which a# specifies to fight fire at house H. 

This formulation of SEQUENTIAL Fire FIGHTING, while still abstract, captures the essential coordination 
challenges inherent in many realistic problems. For instance, this formula t ion m ay directly map to the problem 
of fire fighting in the Robocup Rescue Simulation league dKitano etal.lll999h : fire fighting agents are dis- 



tributed in a city and must decide at which houses to fight fire. While limited communication may be possible, 
it is infeasible for each agent to broadcast all its observations to all the other agents. If instead it is feasible to 
compute a joint BG-policy based on the agents' positions, then they can effectively coordinate without broad- 
casting their observ ations. When interp reting the houses as queues, it also directly corresponds to problems in 



queueing networks dCogill et all 120041) . 



The results of our experiments in this domain, shown in Fig. [12] demonstrate that Max-Plus has the most 
desirable scaling behavior in a variety of different parameters. In particular, Fig. |12a| and |l2b| show that all 
approximate methods scale well with respect to the number of actions per agent, but Max-Plus performs 
best. Fig. U2cl shows that this scalability does not come at the expense of solution quality. For all settings, 
all the methods computed solutions with the same value (other plots of value are thus omitted for brevity). 
The advantage of exploiting agent independence is illustrated in Fig. I12dl which demonstrates that Max-Plus 
scales well with the number of agents, in contrast to the other methods. In Fig. ll2el we varied the Nd parameter, 
which determines how many houses are present in the domain. It shows that Max-Plus is sensitive to k, the 
maximum number of agents that participate in a house. However, it also demonstrates that when the interaction 
is sparse (i.e., when there are many houses per agent and therefore on average fewer than k agents per house) the 
increase in runtime is much better than the worst-case exponential growth^ Fig. ll2fl shows how runtime scales 



13 The dense setting Nj = 0.5 does not have data points at k = 1 because in this case there are not enough houses to be assigned to the 



agents. 
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Figure 12: Results for the GENERALIZED Fire FIGHTING problem. 
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with No, the number of houses observed by each agent. Since the number of types is exponential in No, Max- 
Plus's runtime is also exponential in No- Nonetheless, it substantially outperforms the other approximate 
methods. 



4 Exploiting Independence in Dec-POMDPs 

While CBGs model an important class of collaborative decision-making problems, they apply only to one-shot 
settings, i.e., where each agent needs to select only one action. However, the methods for exploiting agent and 
type independence that we proposed in Sec.[3]can also provide substantial leverage in sequential tasks, in which 
agents take a series of actions over time as the state of the environment evolves. In particular, many sequential 
collaborative decision-mak ing tasks can be forma lized as decentralized partially observable Markov decision 



processes (Dec-POMDPs) ( [Bernstein et aU 120021) . In this section we demonstrate how CGBG s can be used in 



a planning method for the subclass of factored Dec-POMDPs with additively factored rewards dOliehoek et al 



2008cJ). The resulting method, called FACTORED FSPC, can find approximate solutions for classes of problems 
that cannot be addressed at all by any other planning methods. 

Our aim is not to present FACTORED FSPC as a main contribution of this article. Instead, we merely use it 
as a vehicle for illustrating the utility of the CGBG framework. Therefore, for the sake of conciseness, we do 
not describe this method in full technical detail. Instead, we merely sketch the solution approach and supply 
references to other work containing more detail. 



4.1 Factored Dec-POMDPs 

In a Dec-POMDP, multiple agents must collaborate to maximize the sum of the common rewards they receive 
over multiple timesteps. Their actions affect not only their immediate rewards but also the state to which they 
transition. While the current state is not known to the agents, at each timestep each agent receives a private 
observation correlated with that state. In a factored Dec-POMDP, the state consists of a vector of state variables 
and the reward function is the sum of a set of local reward functions. 

Definition 4.1. A factored Dec-POMDP is a tuple (T>,§,A,T,%0,O,b ,h), where 

• D = {1, . . . ,«} is the set of agents. 

• S = Xi x . . . x X|x| is the factored state space. That is, § is spanned by X = {Xi, . . . ,X|x|}, a set of state 
variables, or factors. A state corresponds to an assignment of values for all factors s = (jci,... ,*|X|)- 

• A = X iAi is the set of joint actions, where Aj is the set of actions available to agent i. 

• T is a transition function specifying the state transition probabilities Pr(V |s,a). 

• 31 = {R l , . . . ,R p } is the set of p local reward functions. Again, these correspond to a interaction graph 
with (hyper) edges £ such that the total immediate reward R(s,a) = L e e£ R e ( x e,&e)- 

• = x,0 ; is the set of joint observations o = (oi,...,o„). 

• O is the observation function, which specifies observation probabilities Pr(o|a,i'). 

• b° is the initial state distribution at time t = 0. 

• h is the horizon, i.e., the number of stages. We consider the case where h is finite. 
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At each stage t = 0...h — 1, each agent takes an individual action and receives an individual observation. 
Their goal is to maximize the expected cumulative reward or return. The planning task entails finding a joint 

policy K = (tti , ,n„), that specifies an individual policy 7T, for each agent ;. Such an individual policy in 

general specifies an individual action for each action-observation history 6- = (a®,oj , . . . ,a' _1 ,o'), e.g., 7Ci(d-) = 
a\. However, when only allowing deterministic or pure policies, 7T, maps each observation history (oj , . . . ,o\) = 
o\ £ 0' to an action, e.g., 7T,(o?) = a'. In a factored Dec-POM DP, the transition and observation model can be 
compactly represented in a dynamic Bayesian network (DBN) ( Boutilier et al. . 19991) . 



4.1.1 Sequential Fire Fighting 

As a running example we consider the SEQUENTIAL Fire FIGHTING probl em, which is a sequentia l variation 
of Generalized Fire Fighting from Sec. [321 originally introduced by lOliehoek et all d2008cl) Fl In this 
version, each house, instead of simply being on fire or not, has an integer fire level that can change over time. 
Thus, the state of the environment is factored, using one state variable for the fire level in each house. Each 
agent receives an observation about the house at which it fought fire in the last stage. It observes flames (F) at 
this house with probability 0.2 if xh = 0, with probability 0.5 if xh = 1, and with probability 0.8 otherwise. 

At each stage, each agent i chooses at which of its assigned houses to fight fire. These actions affect 
(probabilistically) to what state the environment transitions: the fire level of each house H is influenced by its 
previous value, by the actions of the agents that can go to H and the fire level of the neighboring houses. The 
transitions in turn determine what reward is generated. Specifically, each house generates a negative reward 
equal to its expected fire level at the next stage x\ . Thus, the reward function can be described as the sum of 
local reward functions, one for each house. We consider the case of Nh — 4 as shown in Fig. [4] For house 
H = 1 the reward is specified by 



^(^{1,2}^) =L _j *i Pr (*'ll X {l,2}> a l)> 



(4.1) 



where X{ 12 } denotes (x\,X2)- This formulation is possible because Xi,X2 and a\ are the only variables that 
influence the probability of x\. Similarly, the other local reward functions are given by R 2 (x^ ,2,3},a{i,2})> 
-ft 3 ( x {2,3,4}> a {2,3}) an d R 4 (^{-i A} : a^)- For more d etails about the formulation of SEQUENTIAL Fire FIGHTING 
as a factored Dec-POMDP, see jOliehoekll2010l) . 



4.1.2 Solving Dec-POMDPs 

Solving a Dec-PO MDP entails finding an optimal joint policy. Unfortunately, optim ally solving Dec-POMD Ps 
is NEXP-complete dBernstein et all 120021) . as is finding an e-approximate solution (Ra binovich et all 120031) . 
Given these difficulties, most research efforts have focused on speci al cases that are mo re tractable. In par- 



ticular, assumptions of transition and observation independence (TOI) (B ecker et al 
gated to exploit independence between agents, e.g., as in ND-POMDPs dNair et all 



2005 



2004 ) have been investi- 



Varakantham et al 



20071) . However, given the TOI assumption, many in teresting tasks, such as two robots carrying a chair, cannot 
be modeled. Recently, Witwicki and Durfee ( 2010h proposed transition-decoupled Dec-POMDPs, in which 
there is limited interaction between the agents. While this approach speeds up both optimal and approximate 
solutions of this subclass, scalability remains limited (the approach has not been tested with more than two 
agents) and the sub-class still is quite restrictive (e.g., it does not admit the chair carrying scenario). 

Other work has considered ap proximate methods for the genera l class of Dec-POMDPs based on represent- 
ing a Dec-POMDP using CBGs dEmerv-Montemerlo et all 120041) or on approximate dynamic programming 



14 In lOliehoek et alj . [2008ch . this problem is referred to as FlREFlGHTINCGRAPH. We use a different name here to distinguish it from 
Generalized Fire Fighting, which is also graphical. 
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( Seuken and ZilbersteinL 2007 ). In the remainder of this section, we show CGBGs can help the former category 
of methods achieve unprecedented scalability with respect to the number of agents. 



4.2 Factored Dec-POMDPs as Series of CGBGs 

A stage t of a Dec-POMDPcanbe represented as aCBG, given a past joint policy <p'. Such a (p l = (8°, . . . ,8 tl ) 
specifies the joint decision rules for the first t stages. An individual decision rule 8] of agent ; corresponds to 
the part of its policy that specifies actions for stage t . That is, 5/ maps from observation histories o\ to actions 
flj. Given a <p', the corresponding CGB is constructed as follows: 

• The action sets are the same as in the Dec-POMDP, 

• Each i's action-observation history 0- corresponds to its type: 0, = df, 

• The probability of joint types is specified by Pr(6'\b° ,(p') = L s rPr(s',0'|fr°,<p'). 

• The payoff function u(G,a) = Q(9' ,a), the expected payoff for the remaining stages. 

Similarly, a factored Dec-POMDP can be represented by a series of CGBGs. This is possible because, in 
general, the Q-function is factored. In other words, it is the sum of a set of local Q-functions of the following 
form: Q e (x' e ,6' e ,a e ). Each local Q-function depends on a subset of state factors (the state factor scope) and the 
action-observation histories and actions of a subset of agents (the agent scope) (lOliehoekil2010» . 

Constructing a CGBG for a stage t is similar to constructing a CBG Given a past joint policy <p', we can 
construct the local payoff functions for the CGBG: 

u e (e e ,a e ) = fi;,(^,a P ) =^Pr(x' f |^,fo°,(p')G e (x^0' e ,a e ). (4.2) 

As such, the structure of the CGBG is induced by the structure of the Q-value function. 

As an example, consider 3-agent h = 2 SEQUENTIAL Fire FIGHTING. The last stage t = 1 can be repre- 
sented as a CGBG given a past joint policy q> 1 . Also, since it is the last stage, the factored immediate reward 
function, (e.g., as in Equati onl4. Ill represents all th e expected future reward. That is, it coincides with an optimal 



factored Q-value function ( IQliehoek et al. . l2008ch and can be written as follows: 



Q^iOlae) =£Pr(x'|0>>V(x',a,)- (43) 

This situation can be represented using a CGBG, by using the Q-value function as the payoff function: 
u e (9 e ,a e ) = (0g,a e ) as shown in Fig. Qj] The figure shows the 4 components of the CGBG, each one 
corresponding to the payoff associated with one house. It also indicates an arbitrary BG policy for agent 2. 
Note that, since components 1 and 4 of the Q-value function have scopes that are 'subscopes' of components 
2 and 3 respectively, the former can be absorbed into the latter, reducing the number of components without 
increasing the size of those th at remain. 



The following theorem bv lOliehoek et al.l d2008d) . shows that modeling a factored Dec-POMDP in this way 
is in principle exact. 

Theorem 4.1. Modeling a factored Dec-POMDP with additive rewards using a series of CGBGs is exact: it 
yields the optimal solution when using an optimal Q-value function. 

While an optimal Q-value function is factored, the last stage contains the most independence: when moving 
back in time towards t = 0, the scope of dependence grows, due to the transition and observation functions. 
Fig. P~4l illustrates this process in SEQUENTIAL Fire FIGHTING. Thus, even though the value function is 
factored, the scopes of its components may at earlier stages include all state factors and agents. 
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Figure 13: A CGBG for t — 1 of SEQUENTIAL Fire FIGHTING. Given a past joint policy <p 1 , each joint type 
9 corresponds to ajoint action-observation history 9 , The entries give the Q-values Q e (9' e ,a e ). Highlighted is 
an arbitrary policy for agent 2. 



4.3 Factored Forward Sweep Policy Computation 

Defining the payoff function for each CBG representing a stage of a Dec-POMDP requires computing Q(9' ,a), 
an optimal payoff function. Unfortunately, doing so is intractable. Therefore, Dec-POMDP methods based on 
CBGs typically use approximate Q-value functions instead. One option (Qbg) is t0 assume that each agent 
always has access to the joint observations for all previous stages, but can access only its individual observation 
for the current stage. Another choice is based on the underlying POMDP (QpomdpX i- e -, a POMDP with the 
same transition and observation function in which a single agent receives the joint observations and takes joint 
actions. A third option is based on the underlying MDP (QmdpX m which this single agent can directly observe 
the state. 

Given an approximate Q-function, an approximate solution for the Dec-POMDP can be computed via 

forward-sweep policy computation (FSPC) by simply solving the CBGs for stages 0,1, ,h— 1 consecutively. 

The solution to each CBG is a joint decision rule specified by 8 = /3 '* that is used to augment the past policy 
<p' + 1 = ( <p' , 8 ) . The CBGs are solved consecutively because both the probabilities and the payoffs at each stage 
depend on the past policy. It is also possible to compute an optimal policy via backtracking, as in multiagent A* 
(MAA*). However, since doing so is much more computationally intensive, we focus on FSPC in this article. 

In the remainder of this section, we describe a method we call FACTORED FSPC for approximately solving 
a broader class of factored Dec-POMDPs in a way that scales well in the number of agents. The main idea is 
simply to replace the CBGs used in each stage of the Dec-POMDPs with CGBGs, which are then solved using 
the methods presented in Sec. [3] 

Since finding even bounded approximate solutions for Dec-POMDPs is NEXP-complete, any computa- 
tionally efficient method must necessarily make unbounded approximations. Below, we describe a set of such 
approximations designed to make FACTORED FSPC a practical algorithm. While we present only an overview 
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h - 3 h - 2 h-l 

Figure 14: The scope of Q , illustrated by shading, increases when going back in time in 
Sequential Fire Fighting. 



here, complete details are available in dOliehpekl 1201 Oh . Als o, though the results we present in this article 
evaluate only the complete resulting method, Oliehoekl ( 2010h empirically evaluated each of the component 
approximations separately. 



4.3.1 Approximate Inference 

One source of intractability in FACTORED FSPC lies in the marginalization required to compute the prob- 
abilities Pv(9' e \b° ,<p') and Pv(x' e \9' e ,b° ,<p'). In particular, constructing each CGBG requires generating each 
component e separately. However, as ( 14.2b shows, in general this requires the probabilities Pv(x' e \9' e ,b ,<p'). 
Moreover, in any efficient solution algorithm for Dec-POMDPs, the probabilities Pv(9' e \b° ,<p') are necessary, 
as illustrated by (13. Q . Since maintaining and marginalizing over Pr(s,9'\b° ,(p') is intractable, we resort to ap- 
proximate inference, as is standard practice when computing probabilities over states with many factors. Such 
methods perform well in many cases and the error they introduce can in some cases be theoretically bounded 
dBoven and Kollerlll998h . 



In our case, we use the factored frontier (FF) algorithm dMurphv and Wei ss. 2001) to perform approximate 
inference on a DBN constructed for the past joint policy <p' under concern. This DBN models stages 0, . . . ,t and 
has both state factors and action-observation histories as its nodes. We use FF because it is simple and allows 
computation of some useful intermediate representations wh en a heuristic of the form Q e ( xj,ai) (e.g., factored 
Qmdp) is used. Other approximate inference algorithms (e.g., iMurphvteOoil: lMooiill2008al) . could also be used. 



4.3.2 Approximate Q-Value Functions 

Computing the optimal value functions to use as payoffs for the CGBG for each stage is intrac table. For small 



Dec-PO MDPs, heuristic payoff functions such as Qmdp an d Qpomdp are typically used instead dOliehoek et al 
2008bl) . However, for Dec-POMDPs of the size we consider in this article, solving the underlying MDP or 
POMDP is also intractable. 

Furthermore, factored Dec-POMDPs pose an additional problem: the scopes of Q* increase when going 
backwards in time, such that they are typically fully coupled for earlier stages (see Fig. [141 . This problem is 
exacerbated when Qmdp an d Qpomdp 816 use d as heuristic payoff functions because they become fully coupled 
through just one backup (due to the maximization over joint actions that is conditioned on the state or belief). 

Fortunately, many researchers have considered factored approximations for factored MDPs and factored 
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POMD P s dSchweitzer and Seidmanl[l98llKoller and Panlll999[l200(AlSchuurmans and Patrascull2002l:lGuestrin et al 



2001 alb , 120031: Ide Farias and Van Royl 12003 ; Szita and Lorincd. 20081) We follow a similar approach for Dec 



POMDPs by using value functions with predetermined approximate scopes. The idea is that in many cases 
the influence of a state factor quickly vanishes with the number of links in the DBN. For instance, in the case 
of transition and observation independence (TOI), the optimal scopes equal those of the factored immediate 
reward function. In many cases where there is no complete TOI, the amount of interaction is still limited, mak- 
ing it possible to determine a reduced set of scopes for each stage that affords a good approximate solution. 
For example, consider the optimal scopes shown in Fig. [14] Though X4 at h — 3 can influence xi at h — 1, the 
magnitude of this influence is likely to be small. Therefore, restricting the scope of Q l to exclude X4 at h — 3 is 
a reasonable approximation. 

Following the literature on factored MDPs, we use manually specified scope structures (this is equivalent 
to specifying basis functions). In the experiments presented in this article, we simply use the immediate reward 
scopes at each stage, though many alternative strategies are possible. While developing methods for finding 
such scope structures automatically is an important goal, it is beyond the scope of this article. A heuristic 
approach suffices to validate the utility of the CGBG framework because our methods require only a good 
approximate factored value function whose scopes preserve some independence. 

To compute a heuristic given a specified scope structure, we use an approach we call transfer planning. 
Transfer planning is motivated by the observation that, for a factored Dec-POMDP, the value function is 'more 
factored' than for a factored MDP In the former, dependence propagates over time, while the latter becomes 
fully coupled through just one backup. Therefore, it may be preferable to directly approximate the factored 
Q-value function of the Dec-POMDP rather than the Qmdp function. To do so, we use the solution of smaller 
source problems that involve fewer agents. That is, transfer planning directly tries to find heuristic values 
Qyt (9' e ,a e ) = Q s (d' ,a) by solving tasks that are similar (but smaller) and using their value functions Q s . The 
Q s can result from the solutions of the smaller Dec-POMDPs, or of their underlying MDP or POMDP. In order 
to map the values Q s of the source tasks to the CGBG components Q e t , we specify a mapping from agents 
participating in a component e to agents in the source problem. 

Since no formal claims can be made about these approximate Q-values, we cannot guarantee that they 
constitute an admissible heuristic. However, since we rely on FSPC, which does not backtrack, an admissible 
heuristic is not necessar ily better. Performance depends on the accuracy, not the admissibility of the heuristic 
dOliehoek et al. ■ l2008bh . The experiments we present below demonstrate that these approximate Q-values are 



accurate enough to enable high quality solutions. 



4.4 Experiments 



We e valuate Factor ed FSPC on two problem domains: Sequential Fire Fighting and the Aloha prob- 
lem ( Oliehoek , 2010l) . The latter consists of a number of islands, each equipped with a radio used to transmit 



messages to its local population. Each island has a queue of messages that it needs to send and at each time 
step can decide whether or not to send a message. When two neighboring islands attempt to send a message 
in the same timestep, a collision occurs. Each island can noisily observe whether a successful transmission (by 
itself or the neighbors), no transmission, or a collision occurred. At each timestep, each island receives a reward 
of —1 for each message in its queue. Aloha is considerably more complex than SEQUENTIAL Fire FIGHT- 
ING. First, it has 3 observations per agent, which means that the number of observation histories grows much 
faster. Also, the transition model of ALOHA is more densely connected than SEQUENTIAL Fire FIGHTING: 
the reward component for each island is affected by the island itself and all its neighbors. As a result, in all the 
ALOHA problems we consider, there is at least one immediate reward function whose scope contains 3 agents, 
i.e., k = 3. Fig. [151 illustrates the case with four islands in a square configuration. The experiments below also 
consider variants in which islands are connected in a line. 
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Figure 15: The Aloha problem with four islands arranged in a square. 




Figure 16: The ATI factor graph for immediate reward scopes in Aloha. The connections of u 2 are shown in 
detail by the black dashed lines; connections for other factors are summarized abstractly by the blue solid lines. 

In all cases, we use immediate reward scopes that have been reduced (i.e., scopes that form a proper sub- 
scope of another scope are removed) before computing the factored Q-value functions. This means that for all 
stages, the factored Q-value function has the same factorization and thus the ATI factor graphs have identical 
shapes (although the number of types differ). For SEQUENTIAL Fire FIGHTING, the ATI factor graph for 
a stage t is the same as that for GENERALIZED Fire FIGHTING (see Fig. O, except that the types Of now 
correspond to action-observation histories 9' k . The ATI factor graph for a stage t of the Aloha problem is 
shown in Fig.[l6l 

To compute Qxp, the transfer-planning heuristic, for the SEQUENTIAL Fire FIGHTING problem, we use 
2-agent Sequential Fire Fighting as the source problem for all the edges and map the lower agent index 
in a scope to agent 1 and the higher index to agent 2. For the Aloha problem, we use the 3-island in-line 
variant as the source problem and perform a similar mapping, i.e., the lowest agent index in a scope is mapped 
to agent 1, the middle to agent 2 and the highest to agent 3. For both problems we use the Qmdp ar, d Qbg 
heuristic for the source problems. 

For problems small enough to solve optimally, we compare the solution quality of FACTORED FSPC to that 
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Figure 17: FACTORED FSPC (ff) solution quality compared to optimal and the baselines. 



of GMAA*-ICE, the state-of-the-art method for optimally solving Dec-POMDPs ilSpaan et al.[l201ll) . We also 



compare against several other approximate met hods for solving Dec-P OMDPs, including non-factored FSPC 



and direct cross-entropy policy search (Dice) dOliehoek et aU l2008al) . one of the few methods demonstrated 



to work on Dec-POMDPs with more than three agents that are not transition and observation independent. For 
non-factored FSPC, we use alternating maximization with 10 restarts to solve the CBGs. For Dice we again 
use the two parameter settings described in Sec. l3.5l (DlCE-normal and DlCE-fast). 

As baselines, we include a random joint policy and the best joint policy in which each agent selects the same 
fixed action for all possible histories (though the agents can select different actions from each other). Naturally, 
these simple policies are suboptimal. However, in the case of the fixed-action baseline, simplicity is a virtue. 
The reason stems from a fundamental dilemma Dec-POMDP agents face about how much to exploit their private 
observations. Doing so helps them accrue more local reward but makes their behavior less predictable to other 
agents, complicating coordination. Because it does not exploit private observations at all, the disadvantages 
of the fixed-action policy are partially compensated by the advantages of predictability, yielding a surprisingly 
strong baseline. 

We also considered including a baseline in which the agents are allowed to select different actions at each 
timestep (but are still constrained to a fixed action for all histories of a given length). However, computing the 
best fixed policy of this form proved intractable0 Note that it is not possible to use the solution to the under- 
lying factored MDP as a baseline, for two reasons. First, computing such solutions is not feasible for prob- 
lems of the size we consider in this article. Second, such solutions would not constitute meaningful baselines 
for comparison. On the contrary, since such solutions cannot be executed in a decentralized fashion without 
communication, they provide only a loose upper boun d on the performance possible with a Dec-POMDP, as 
quantitatively demonstrated by lOliehoek et al. 



The experimental setup is as presented in Sec. 13.5.11 but with a memory limit of 2Gb and a maximum 
computation time of 1 hour. The reported statistics are means over 10 restarts of each method. Once joint 
policies have been computed, we perform 10,000 simulation runs to estimate their true values. 



I5 In fact, the complexity of doing so is 0(\A n J l ), i.e., exponential in both the number of agents and the horizon. This is consistent with 
the complexity result for the non-observable problem (NP-complete) iPvnadath and Tambe. 2002). By searching for an open-loop plan, we 
effectively treat the problem as nonobservable. 
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Figure 18: A comparison of FACTORED FSPC (ff) with different heuristics and other methods on the 
Sequential Fire Fighting problem. 



Fig. [T7] compares FACTORED FSPC's solutions to optimal solutions on both problems. Fig. I17al show 
the results for SEQUENTIAL FlRE FIGHTING with two (red) and three agents (green). Optimal solutions were 
computed up to horizon 6 in the former and horizon 4 in the latter problem. FACTORED FSPC with the Q BG TP 
heuristic achieves the optimal value for all these instances. When using the Qmdp TP heuristic, results are near 
optimal. For three agents, the optimal value is available only up to /; = 4. Nonetheless, the curve of FACTORED 
FSPC's values has the same shape of as that of the optimal values for two agents, which suggests these points 
are near optimal as well. While the fixed action baselines performs relatively well for shorter horizons, it is 
worse than random for longer horizons because there always is a chance that the non-selected house will keep 
burning forever. 

Fig- EH shows results for ALOHA. The Q BG TP heuristic is omitted since it performed the same as using 
Qmdp- F° r a ^ settings at which we could compute the optimal value, FACTORED FSPC matches this value. 
Since the Aloha problem is more complex, Factored FSPC has difficulty computing solutions for higher 
horizons. In addition, the fixed action baseline performs surprisingly well, performing optimally for 3 islands 
and near optimally for 4 islands. As with SEQUENTIAL FlRE FIGHTING, we expect that it would perform 
worse for longer horizons: if one agent sends messages for several steps in a row, its neighbor is more likely 
to have messages backed up in its queue. However, we cannot test this assumption since there are no existing 
methods capable of solving ALOHA to such horizons against which to compare. 

Fig.Q~8]compares FACTORED FSPC to other approximate methods on the SEQUENTIAL FlRE FIGHTING 
domain with h = 5. For all numbers of agents, FACTORED FSPC finds solutions as good as or better than 
those of non-factored FSPC, DlCE-normal, DlCE-fast, and the fixed-action and random baselines. In addition, 
its running time scales much better than that of non-factored FSPC and the fixed-action baseline. Hence, this 
result highlights the complexity of the problem, as even a simple baseline scales poorly. FACTORED FSPC 
also runs substantially more quickly than DlCE-normal and slightly more quickly than DlCE-fast, both of which 
run out of memory when there are more than five agents. 

Fig. [I9]presents a similar comparison for the Aloha problem with /; = 3. DlCE-fast is omitted from these 
plots because DlCE-normal outperformed it. Fig. I19al shows that the value achieved by FACTORED FSPC 
matches or nearly matches that of all the other methods on all island configurations. Fig. ll9bl shows the runtime 
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Figure 19: A comparison of FACTORED FSPC with different heuristics and other methods on the Aloha 
problem with h = 3. 



results for the inline configurations While the runtime of FACTORED FSPC is consistently better than that 
of DiCE-normal, non-factored FSPC and the fixed-action baseline are faster for small numbers of agents. Non- 
factored FSPC is faster for three agents because the problem is fully coupled: there are 3 local payoff functions 
involving 2, 3, and 2 agents, so k=3. Thus FACTORED FSPC incurs the overhead of dealing with multiple 
factors and constructing the FG but achieves no speedup in return. However, the runtime of FACTORED FSPC 
scales much better as the number of agents increases. 

Overall, these results demonstrate that FACTORED FSPC is a substantial improvement over existing ap- 
proximate Dec-POMDP methods in terms of scaling with respect to the number of agents. However, the ability 
of FACTORED FSPC to scale with respect to the horizon remains limited, since the number of types in the 
CGBGs st ill grows exponentially wi t h the horizon. In futu r e work we hope to add ress this problem by cluster- 
ing types ( Emerv-Montemerlo et al. , 2005 : Oliehoek et al. , 2009 : Wu et all 201 lb. In particu l ar, by clustering 
the individual types of an agent th at induce similar payo ff profiles ( lEmerv-Montemerlo et all 12005b or proba- 
bilities over types of other agents (lOliehoek et all l2009h it is possible to scale to much longer horiz ons. When 



aggre ssively clustering to a constant number of types, runtime can be made linear in the horizon (IWu et al 



201 II) . However, since such an improvement is orthogonal to the use of CGBGs, it is beyond the scope of the 
current article. Moreover, we empirically evaluate the error introduced by a minimal number of approximations 
required to achieve scalability with respect to the number of agents. Introducing further approximations would 
confound these results. 

Nonetheless, even in its existing form, FACTORED FSPC shows great promise due to its ability to exploit 
both agent and type independence in the CGBG stage games. To determine the limits of its scalability with 
respect to the number of agents, we conducted additional experiments applying FACTORED FSPC with the 
Qmdp TP heuristic to SEQUENTIAL Fire FIGHTING with many more agents. The results, shown in Fig. l20l 
do not include a fixed action baseline because 1) performing simulations of all considered fixed action joint 
policies becomes expensive for many agents, and 2) the number of such joint policies grows exponentially with 
the number of agents. 



I6 We omit the four agents in a square configuration in order to more clearly illustrate how runtime in the inline configurations scales 
with respect to the number of agents. 
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Figure 20: FACTORED FSPC results on SEQUENTIAL Fire FIGHTING with many agents. 
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As shown in Fig. l20al FACTORED FSPC successfully computed solutions for up to 1000 agents for h = 2,3 
and 750 agents for h = 4. For h = 5, it computed solutions for up to 300 agents; even for h = 6 it computed 
solutions for 100 agents, as shown in Fig. l20bl Note that, for the computed entries for h = 6, the expected value 
is roughly equal to h = 5. This implies that the probability of any fire remaining at stage t = 5 is close to zero, 
a pattern we also observed for the optimal solution in Fig. [17] As such, we expect that the found solutions for 
these settings with many agents are in fact close to optimal. The runtime results, shown in Fig. I20cl increase 
linearly with respect to the number of agents. While the runtime increases with the number of agents, the 
bottleneck in our experiments preventing even further scalability was insufficient memory, not computation 
time. 

These results are a large improvement in the state of the art with res pect to scalability in th e number of 



agents. Previous appro aches for general Dec-POMDPs scaled only to 3 (lOliehoek et all l2008d) or 5 agents 



(foliehoek et all l2008al) Fl Even when making much stricter assum ptions such as transition and observation 



independence, previous approaches have not scaled beyond 15 agents (IVarakantham et all 120071 : iMarecki et al 
2008HVarakantham et all 120091: iKumar and Zilbersternll2009l) 



Though these experiments evaluate only the complete FACTORED FSPC method, in dOliehoekll2010l) we 
have empirically evaluated each of its component approximations separately. For the sake of brevity, we do not 
present those experiments in this article. However, the results confirm that each approximation is reasonable. In 
particular, they show that 1) approximate inference has no significant influence on performance, 2) Max-Plus 
solves CGBGs as well as or better than alternating maximization, 3) the use of 1-step back-projected scopes 
(i.e., scopes grown by one projection back in the DBN) can sometimes slightly outperform the use of immediate 
reward scopes, 4) there is not a large performance difference when using optimal scopes, and 5) depending on 
the heuristic, allowing backtracking can improve performance. 



5 Related Work 



The wide range of research related to the work presented in this article can be mined for various alternative 
strategies for solving CGBGs. In this section, we briefly survey these alternatives, which fall into two cate- 
gories: 1) converting CGBGs to other types of games, 2) converting t hem to constraint optimization p roblems. 
We also discuss the relation to the framework of action-graph games ( Jiang and Levton-Brownl 2008b . 

The first strategy is to convert the CGBG to another type of game. In particular, CGBGs can be converted to 
CGSGs in the same way CBGs can be converted to SGs. These CGSGs can be modeled using interaction hyper- 
graphs or factor graphs such as those shown in Fig.|2]and solved by applying NDP or Max-Plus. However, 
since this approach does not exploit type independence, the size of local payoff functions scales exponentially 
in the number of types, making it impractical for large problems. In fact, this approach corresponds directly to 
the methods, tested in Sec. 13.51 that exploit only agent independence (i.e., NDP-AI and MaxPlus-AI). The poor 
performance of these methods in those experiments underscores the disadvantages of converting to CGSGs. 

CGBGs can also b e converted to non-collaborative graphical SGs, for which a host of solution algorithms 
have recently emerged ( Vickrey and Kollei , 2002 ; Ortiz and Kearns , 2003 ; Daskalakis and Papadimitriou , 20061) . 
However, to do so, the CGBG must first be converted to a CGSG, again forgoing the chance to exploit type 
independence. Furthermore, in the resulting CGSG, all the payoff functions in which a given agent partici- 
pates must then be combined into an individual payoff function. This process, which corresponds to converting 
from an edge-based decomposition to an agent-based one, results in the worst case in yet another exponential 
increase in the size of the payoff function (iKok and VlassisLl2006h . 



Another option is to convert the CGBG into a non-collaborative graphical BG (ISingh et all 120041) by com 



bining the local payoff functions into individual payoff functions directly at the level of the Bayesian game. 



I7 However, in very recent work. lwii et all l2010h present results for up to 20 agents on general Dec-POMDPs. 
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Again, this may lead to an exponential increase in the size of the payo ff functions . Each BNE in the resulting 
graphical BG corresponds to a local optimum of the original CGBG. ISoni et al. (2007) recently proposed a 
solution method for graphical BGs, which can then be applied to find locally optimal solutions. However, this 
method converts the graphical BG into a CGSG and thus suffers an exponential blow-up in the size of the payoff 
functions, just like the other conversion approaches. 

The second strategy is to cast the problem of ma ximization over th e CGBG's factor graph into a (dis- 



tributed) constraint optimization problem ((D)COP) dModi et all 120051). As such, any algor i thm, exactor 



approximate, for ( D)COPs can be used to find a solution for the CG B G dLiu and Svcaral. 1 1 9951: Ffokooi 12001 



Modi et al. , 2005 : Pearce and Tambe , 2007 : Marinescu and Dechtei , 20091) . Oliehoek et alj ( 201 Oh propose a 



heuristic search algorithm for CBGs that u ses this approach and exploits type independence (additivity of the 
value function) at the level of joint types. Kumar and Zilbersteinl ( 2010l) employ state-of-the-art methods for 
weighted constraint satisfaction problems to instances of CBGs in the context of solving Dec-POMDPs. The 
Dec-POMDPs are solved backwards using dynamic programming, resulting in CBGs with few types but many 
actions. This approach exploits type independence but has been tested only as part of a Dec-POMDP solu- 
tion method. Thus, our results in Sec. l4.4l provide additional confirmation of the advantage of exploiting type 
independence in Dec-POMDPs, while our results in Sec. [33] isolate and quantify this advantage in individual 
CGBGs. Furthermore, the approach presented in this article differs from both these alternatives in that it makes 
the use of type independence explicit and simultaneously exploits agent independence as well. 

Finally, our work is also related to the framework of action -graph games (AGGs), which w as recently 
extended to handle imperfect information and can model any BG ( Jiang and Levton-Brown 2010l) . This work 
proposes two solution methods for general-sum Bayesian AGGs (BAGGs): the Govindan- Wilson algorithm and 
simplicial subdivision. Both involve comp utation of the expected payoff of each agent given a current profile as 
a key step in an inner loop of the algorithm. Jiang and Leyton-Brownl ( 2010l) show how this expected payoff can 
be computed efficiently for each (agent, type)-pair, thereby exploiting type (and possibly agent) independence 
in this inner loop. As such, this approach may compute a sample Nash equilibrium more efficiently than without 
using this structure. 

On the one hand, BAGGs are more general than CGBGs since they additionally allow representation of 
context-specific independence and anonymity. Furthermore, the solution method is more general since it works 
for general-sum games. On the other hand, in the context of collaborative games, a sample Nash equilibrium 
is not guaranteed to be a PONE (but only a local optimum). In contrast, we solve for the global optimum 
and thus a PONE. In addition, their approach does not exploit the synergy that independence brings in the 
identical payoff setting. In contrast, NDP and Max-Plus, by operating directly on the factor graph, exploit 
independence not just within an inner loop but throughout the computation of the solution. Finally, note that our 
algorithms also work for collaborative BAGGs that possess the same form of structure as CGBGs (i.e., agent 
and type independence). In cases where there is no anonimity or context-specific independence (e.g., as in the 
CGBGs generated for Dec-POMDPs), the BAGG framework offers no advantages. 



6 Future Work 



The work presented in this article opens several interesting avenues for future work. A straightforward extension 
of our methods would replace Ma x-Plus with more recent mess age passing algorithms for belief propagation 
that are guaranteed to converge ( Globerson and Jaakkolal 2008 ). Since these a lgorithms are guaranteed to 
compute the exact MAP configuration for perfect graphs with binary variables dJebaraL Eo09h . the resulting 
approach would be able to efficiently compute optimal solutions for CGBGs with two actions and perfect 
interaction graphs. 

Another avenue would be to investigate whether our algorithms can be extended to work on a broader 
class of collaborative Bayesian AGGs. Doing so could enable our approach to also exploit context-specific 
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independence and anonymity. A complementary idea is to extend our algorithms to the non-collaborative case 
by rephrasing the task of finding a sample Nash equilibrium as one of minimizing regret, as suggested by 



Vickrev and Kolleri ( 120021) . 



Ou r approach to solving Dec-POMDPs with CGBGs cou ld be integrated with methods for clustering his- 
tories ( lEmerv-Montemerlo et al.L 12005; Oliehoe k et al. . l2009h to allow scaling to larger horizons. In addition, 
there is great potential for further improvement in the accuracy and efficiency of computing approximate value 
functions. In particular, the transfer planning approach co uld be extended to t ransfer tasks with different action 



and/or observation spaces, as done in transfer learning (iTaylor et all 120071) . Furthermore, it may be possi- 
ble to automatically i dentify suitable source tasks and mappings between tasks, e.g., using qualitative DBNs 
dLiuand Stonell2006l). 



7 Conclusions 

In this article, we considered the interaction of several agents under uncertainty. In particular, we focused 
on settings in which multiple collaborative agents, each possessing some private information, must coordinate 
their actions. Such settings can be formalized by the Bayesian game framework. We presented an overview of 
game-theoretic models used for collaborative decision making and delineated two different types of structure 
in collaborative games: 1) agent independence, and 2) type independence. 

Subsequently, we proposed the collaborative graphical Bayesian game (CGBG) as a model that facilitates 
more efficient decision making by decomposing the global payoff function as the sum of local payoff functions 
that depend on only a few agents. We showed how CGBGs can be represented as factor graphs (FGs) that 
capture both agent and type independence. Since a maximizing configuration of the factor graph corresponds 
to a solution of the CGBG, this representation also makes it possible to effectively exploit this independence. 

We considered two solution methods: non-serial dynamic programming (NDP) and Max-Plus message 
passing. The former has a computational complexity that is exponential in the induced tree width of the FG, 
which we proved to be exponential in the number of individual types. The latter is tractable when there is 
enough independence between agents; we showed that it is exponential only in k, the maximum number of 
agents that participate in the same local payoff function. An empirical evaluation showed that exploiting both 
agent and type agent independence can lead to a large performance increase, compared to exploiting just one 
form of independence, without sacrificing solution quality. For example, the experiments showed that this 
approach allows for the solution of coordination problems with imperfect information for up to 750 agents, 
limited only by a 1GB memory constraint. 

We also showed that CGBGs and their solution methods provide a key missing component in the approx- 
imate solution of Dec-POMDPs with many agents. In particular, we proposed FACTORED FSPC, which ap- 
proximately solves Dec-POMDPs by representing them as a series of CGBGs. To estimate the payoff functions 
of these CGBGs, we computed approximate factored value functions given predetermined scope structures via 
a method we call transfer planning. It uses value functions for smaller source problems as components of the 
factored Q-value function for the original target problem. An empirical evaluation showed that FACTORED 
FSPC significantly outperforms state-of-the-art methods for solving Dec-POMDPs with more than two agents 
and scales well with respect to the number of agents. In particular, FACTORED FSPC found (near-)optimal 
solutions on problem instances for which the optimum can be computed. For larger problem instances it found 
solutions as good as or better than comparison Dec-POMDP methods in almost all cases and in all cases out- 
performed the baselines. The most salient result from our experimental evaluation is that the proposed method 
is able to compute solutions for problems that cannot be tackled by any other methods at all (not even the base- 
lines). In particular, it found good solutions for up to 1000 agents, where previously only problems with small 
to moderate numbers of agents (up to 20) had been tackled. 
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8 Appendix 

Lemma 8.1 (Complexity of Max-Plus). The complexity of one iteration o/Max-Plus is 

o(m k -k z -l-F), (8.1) 

where, F is the number of factors, V is the number of variables, the maximum degree of a factor is k, the 
maximum degree of a variable is I, the maximum number of values a variable can take is m. 

Proof. We can directly derive that the number of edges is bounded by e = F ■ k. Messages sent by a variable 
are constructed by summing over incoming messages. As a variable has (on average) / = y neighbors, this 
involves adding (on average) I — 1 incoming messages of size m. The cost of constructing one message for one 
variable therefore is: 0(m ■ (I — 1)). This means that the total cost of constructing all e = 0(F ■ k) messages sent 
by variables, one over each edge, is 

0{m-{l-\)-F-k) (8.2) 

Now we consider the messages sent by factors. Recall that t he maximum siz e of a factor is m k . The 
construction of each message entails factor-message addition (see lOliehoekl (1201 Oh . Sec. 5.5.3) with k - 1 



incoming messages, each one has cost 0{m k ). This leads to a cost of 

0((k-l)-m k ) = 0(k-m k ) 

per factor message, and a total cost of 



0{F-k l -m k ). (8.3) 
The complexity of a single iteration of Max-Plus is the sum of ( 18.21 and (18.31 . which can be reduced to 

(ED . □ 
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